tf-idf Scoring Function

Context:
- inputs ([math]\displaystyle{ t,D,\mathbf{C} }[/math]):
  - a Multiset Member, [math]\displaystyle{ t }[/math] (e.g. a vocabulary member).
  - a Multiset, [math]\displaystyle{ D }[/math] (e.g. a document bag-of-words).
  - a Multiset Set, [math]\displaystyle{ \mathbf{C} }[/math] (e.g. a corpus).
- output(s):
  - tf-idf Score.
- definition:
  - [math]\displaystyle{ \operatorname{tf-idf}(t,D,\mathbf{C}) = \operatorname{tf}(t,D) \times \operatorname{idf}(t,\mathbf{C}) }[/math].
- …
Counter-Example(s):
See: tf-idf Vector, Text Corpus.

References

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Tf–idf Retrieved:2015-2-22.
- tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
  The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
  Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
  One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

(Pazzani & Billsus, 2007) ⇒ Michael J. Pazzani, and Daniel Billsus. “Content-based recommendation systems." In The adaptive web, pp. 325-341. Springer Berlin Heidelberg, 2007.
- QUOTE: ... associated with a term is a real number that represents the importance or relevance. This value is called the tf*idf weight (term-frequency times inverse document frequency). The tf*idf weight, w(t,d), of a term t in a document d is a function of the frequency of t in the document (tft,d), the number of documents that contain the term (dft) and the number of documents in the collection (N)^[1]

↑ Note that in the description of tf*idf weights, the word “document” is traditionally used since the original motivation was to retrieve documents. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called tf*idf. In general, tf*idf systems have weights that increase monotonically with term frequency and decrease monotonically with document frequency.