What is it about?
We discover a surprising degree of systematic disagreement that was occasionally noted but not explained in the literature by previous authors. We find an explanation for the discrepancy between the metrics in the effect of popularity biases, which impact false and true-positive metrics in very different ways: instead of rewarding the recommendation of popular items, as with true-positive, false-positive metrics penalize the popular.
Featured Image
Photo by Paweł Czerwiński on Unsplash
Why is it important?
Psychological studies have reported that bad experiences usually outperform good ones in the overall assessment. Recommender systems evaluation has been mostly focused on measuring good experiences, hence counting true-positives. The idea, in this case, is to recommend as many good items as possible. However, if bad experiences are known to cause a more substantial impact, the recommendation task could also be considered to avoid recommending as many bad items as possible. One may think that true and false positive metrics are complementary, but we found that in Recommender System evaluation, this is not always the case. Considering this last idea, metrics that measure bad experiences can provide a broader perspective on evaluation. In our study, we found the main reason for disagreement is related to a strong popularity bias in the data for training and testing algorithms. Moreover, we determine under which circumstances true or false-positive metrics should be used for offline evaluation.
Perspectives
Read the Original
This page is a summary of: Agreement and Disagreement between True and False-Positive Metrics in Recommender Systems Evaluation, July 2020, ACM (Association for Computing Machinery),
DOI: 10.1145/3397271.3401096.
You can read the full text:
Resources
Contributors
The following have contributed to this page