Item Recommendations System Performance Measure
An Item Recommendations System Performance Measure is a ranked subset prediction task performance measure for an item recommendations task.
- Context:
- It can range from being an Offline Item Recommendations Task Performance Measure to being an Online Item Recommendations Task Performance Measure.
- It can range from being a Predictive Quality-based Item Recommendations Task Performance Measure to being an Computation Effort-based Item Recommendations Task Performance Measure.
- Example(s):
- Counter-Example(s):
- See: Item Recommendations System, IR System, Ordinal Prediction Task, Subset Prediction Task.
References
2020
- (Krichene & Rendle, 2020) ⇒ Walid Krichene, and Steffen Rendle. (2020). “On Sampled Metrics for Item Recommendation.” In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2020).
- QUOTE: ... This section starts by formalizing the most common evaluation scheme for item recommendation. Let there be a pool of 𝑛 items to recommend from. For a given instance1 x, a recommendation algorithm, 𝐴, returns a ranked list of the 𝑛 items. In an evaluation, the positions, 𝑅(𝐴, x) ⊆ {1, ... , 𝑛}, of the withheld relevant items within this ranking are computed – 𝑅 will also be referred to as the predicted ranks. For example, 𝑅(𝐴, x) = {3, 5} means for an instance x recommender 𝐴 ranked two relevant items at positions 3 and 5. Then, a metric 𝑀 is used to translate the positions into a single number measuring the quality of the ranking. This process is repeated for a set of instances, 𝐷 = {x1, x2, ...}, and an average metric is reported …
2018
- (Rybakov et al., 2018) ⇒ Oleg Rybakov, Vijai Mohan, Avishkar Misra, Scott LeGrand, Rejith Joseph, Kiuk Chung, Siddharth Singh, Qian You, Eric Nalisnick, Leo Dirac, and Runfei Luo. (2018). “The Effectiveness of a Two-layer Neural Network for Recommendations.”
- QUOTE: There are many different metrics focusing on specific properties of the recommendation algorithm (6, 2014). Among all, root mean square error (RMSE) is the most popular one ((Qu et al., 2016), (Sedhain et al., 2015)). It requires explicit feedbacks (ratings). Nevertheless, in many practical applications recommender systems need to be centered on implicit feedbacks (Hu et al., 2008). Implicit information like clicks and purchases are normally tracked automatically, customers do not need to explicitly express their attitude, therefore are easier to collect. In the scenario of predicting future purchase from implicit feedback data, we use two metrics throughout evaluations in this paper: Precision at K and Product Converted Coverage (PCC) at K.
Precision at K is the accuracy of the predicted recommendations with respect to the actual purchases: : [math]\displaystyle{ Precision@K = \frac{1}{C} \frac {\Sigma^{C-1}_{c=0} \mid \{Rec_c\} \cap \{T_c\}) \mid} {K}, (1) }[/math] where K is the position/rank of a recommendation, c is the customer index, Recc is top K recommended items for customer c, Tc is actual consumptions for customer c represented as the set of items the customer purchased in the evaluation period (where interaction can be purchases, watches, listens), jRecj is the number of items in set Rec, Rec \ T is the intersection between sets Rec and T, and C is the number of customers.
While having high precision is necessary, it is not sufficient. A personalized recommender should also recommend diverse set of items (Adomavicius & Kwon, 2012). For example, if precision is high with no diversity, then recommendations looks like a hall of mirrors showing only products in a single topic. Therefore, to guarantee the diversity of recommendations, we use products converted coverage at K. It captures the number of unique products being recommended at top K and at the same time purchased: : [math]\displaystyle{ PCC@K = \frac{1}{P} \mid \cup_c^{C-1} (\{Rec_c\} \cap \{T_c\}) \mid , (2) }[/math] where [math]\displaystyle{ \cup_c^{C-1} (X_c) }[/math] represents union of sets [math]\displaystyle{ X_0,X_1,...,X_{C-1}, P }[/math] is total number of products.
Using held-out labels to measure a recommender’s efficacy is leaking future purchase information (Covington et al., 2016). Consequently, there exists the risk of having inconsistent performance between offline and online evaluation. In order to reduce this gap and emulate real production environment, the test metrics in this paper are measured on future purchases instead of held out data.
2006
- (McNee et al., 2006) ⇒ Sean M. McNee, John Riedl, and Joseph A. Konstan. (2006). “Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems.” In: CHI '06 Extended Abstracts on Human Factors in Computing Systems. ISBN:1-59593-298-4 doi:10.1145/1125451.1125659
- QUOTE: ... Recommender systems have shown great potential to help users find interesting and relevant items from within a large information space. Most research up to this point has focused on improving the accuracy of recommender systems. …