Mean Average Precision (MAP) Performance Metric

From GM-RKB
Jump to navigation Jump to search

A Mean Average Precision (MAP) Performance Metric is a ranking performance measure with parameter, [math]\displaystyle{ n }[/math], that averages the average precision over the [math]\displaystyle{ n }[/math] top ranking results.

  • Context:
    • It can be defined as [math]\displaystyle{ \operatorname{MAP} = \frac{\sum_{q=1}^Q \operatorname{AveP(q)}}{Q} \! }[/math], where Q is the number of queries.
    • It can be defined as [math]\displaystyle{ \mbox{MAP}(Q) = \frac{1}{\vert Q\vert} \sum_{j=1}^{\vert Q\vert} \frac{1}{m_j} \sum_{k=1}^{m_j} \mbox{Precision}(R_{jk}) }[/math], where the set of relevant documents for an information need [math]\displaystyle{ q_j \in Q }[/math] is [math]\displaystyle{ \{d_1, \ldots d_{m_j}\} }[/math] and [math]\displaystyle{ R_{jk} }[/math] is the set of ranked retrieval results from the top result until you get to document [math]\displaystyle{ d_k }[/math].
    • It can (often) be applied to Information Retrieval Tasks.
  • Counter-Example(s):
  • See: IR Task.


References

2020

2012a

  • http://www.kaggle.com/wiki/MeanAveragePrecision
    • QUOTE: Suppose there are m missing outbound edges from a user in a social graph, and you can predict up to 10 other nodes that the user is likely to follow. Then, by adapting the definition of average precision in IR, the average precision at n for this user is: [math]\displaystyle{ ap@n=\sum_{k=1}^{n}P(k)/min(m,n) }[/math] where if the denominator is zero, the result is set zero; P(k) means the precision at cut-off k in the item list, i.e., the ratio of number of users followed up to the position k over the number k, and P(k) equals 0 when k -th item is not followed upon recommendation; n = 10

2012b

  • http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
    • QUOTE: ... In recent years, other measures have become more common. Most standard among the TREC community is Mean Average Precision (MAP), which provides a single-figure measure of quality across recall levels. Among evaluation measures, MAP has been shown to have especially good discrimination and stability. For a single information need, Average Precision is the average of the precision value obtained for the set of top $k$ documents existing after each relevant document is retrieved, and this value is then averaged over information needs. That is, if the set of relevant documents for an information need [math]\displaystyle{ q_j \in Q }[/math] is [math]\displaystyle{ \{d_1, \ldots d_{m_j}\} }[/math] and [math]\displaystyle{ R_{jk} }[/math] is the set of ranked retrieval results from the top result until you get to document [math]\displaystyle{ d_k }[/math], then :[math]\displaystyle{ \mbox{MAP}(Q) = \frac{1}{\vert Q\vert} \sum_{j=1}^{\vert Q\vert} \frac{1}{m_j} \sum_{k=1}^{m_j} \mbox{Precision}(R_{jk}) }[/math] (43)

      When a relevant document is not retrieved at all,[*]the precision value in the above equation is taken to be 0. For a single information need, the average precision approximates the area under the uninterpolated precision-recall curve, and so the MAP is roughly the average area under the precision-recall curve for a set of queries.

      Using MAP, fixed recall levels are not chosen, and there is no interpolation. The MAP value for a test collection is the arithmetic mean of average precision values for individual information needs. (This has the effect of weighting each information need equally in the final reported number, even if many documents are relevant to some queries whereas very few are relevant to other queries.) Calculated MAP scores normally vary widely across information needs when measured within a single system, for instance, between 0.1 and 0.7. Indeed, there is normally more agreement in MAP for an individual information need across systems than for MAP scores for different information needs for the same system. This means that a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queries.

      The above measures factor in precision at all recall levels. For many prominent applications, particularly web search, this may not be germane to users. What matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at $k$, for example “Precision at 10. It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at [math]\displaystyle{ k }[/math].

2008