Information Retrieval Evaluation Measure

An Information Retrieval Evaluation Measure is a Performance Measure that is used to evaluate the effectiveness of an Information Retrieval Systems.

AKA: Evaluation Measure.
Context:
- It can range from being an Offline Information Retrieval Metric to being an Online Information Retrieval Metric.
- It can range from being an Unranked Retrieval Evalution Measure to being a Ranked Retrieval Evalution Measure.
- ...
Example(s):
- an Online Information Retrieval Metric such as:
  - a Session Abandonment Rate,
  - a Session Success Rate.
  - a Click-through Rate (CTR),
  - a Zero Result Rate (ZRR);
- an Offline Information Retrieval Metric such as:
  - Precision,
  - Recall,
  - Fallout,
  - F-measure,
  - Average Precision,
  - Precision at K,
  - R-Precision,
  - Mean Average Precision,
  - Discounted Cumulative Gain,
  - False Positive Rate.
- …
Counter-Example(s)
- Point Estimate.
See: Symmetric Difference, Information Retrieval System, Intersection (Set Theory), Cardinality, Integral, Summation.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval) Retrieved:2023-9-29.
- Evaluation measures for an information retrieval (IR) system assess how well an index, search engine or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms. The success of an IR system may be judged by a range of criteria including relevance, speed, user satisfaction, usability, efficiency and reliability. However, the most important factor in determining a system's effectiveness for users is the overall relevance of results retrieved in response to a query. Evaluation measures may be categorised in various ways including offline or online, user-based or system-based and include methods such as observed user behaviour, test collections, precision and recall, and scores from prepared benchmark test sets. Evaluation for an information retrieval system should also include a validation of the measures used, i.e. an assessment of how well they measure what they are intended to measure and how well the system fits its intended use case. Measures are generally used in two settings: online experimentation, which assesses users' interactions with the search system, and offline evaluation, which measures the effectiveness of an information retrieval system on a static offline collection.

2018

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval) Retrieved:2018-2-18.
- Evaluation measures for an information retrieval system are used to assess how well the search results satisfied the user's query intent. Such metrics are often split into kinds: online metrics look at users' interactions with the search system, while offline metrics measure relevance, in other words how likely each result, or SERP page as a whole, is to meet the information needs of the user.
  The mathematical symbols used in the formulas below mean:
  - [math]\displaystyle{ X \cap Y }[/math] - Intersection - in this case, specifying the documents in both sets X and Y
  - [math]\displaystyle{ | X | }[/math] - Cardinality - in this case, the number of documents in set X
  - [math]\displaystyle{ \int }[/math] - Integral.
  - [math]\displaystyle{ \sum }[/math] - Summation.
  - [math]\displaystyle{ \Delta }[/math] - Symmetric difference

2008

(Manning et al.,2008) ⇒ Manning, C. D., Raghavan, P., & Schütze, H. (2008). ["Chapter 8: Evaluation in information retrieval"]. In: "Introduction to information retrieval" (Vol. 1, No. 1). Cambridge: Cambridge university press.
- We have seen in the preceding chapters many alternatives in designing an IR system. How do we know which of these techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting? Information retrieval has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections.
  In this chapter we begin with a discussion of measuring the effectiveness of IR systems (Section 8.1 ) and the test collections that are most often used for this purpose (Section 8.2). We then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results (Section 8.3). This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text classification and why they are appropriate. We then extend these notions and develop further measures for evaluating ranked retrieval results (Section 8.4 ) and discuss developing reliable and informative test collections (Section 8.5).
  We then step back to introduce the notion of user utility, and how it is approximated by the use of document relevance (Section 8.6). The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness. It seems reasonable to assume that relevance of results is the most important factor: blindingly fast, useless answers do not make a user happy. However, user perceptions do not always coincide with system designers' notions of quality. For example, user happiness commonly depends very strongly on user interface design issues, including the layout, clarity, and responsiveness of the user interface, which are independent of the quality of the results returned. We touch on other measures of the quality of a system, in particular the generation of high-quality result summary snippets, which strongly influence user utility, but are not measured in the basic relevance ranking paradigm (Section 8.7 ).