IDF Scoring Function

AKA: Inverse Document Frequency Measure.
Context:
- inputs, [math]\displaystyle{ (t,\mathbf{C}) }[/math]
  - a Multiset Member, [math]\displaystyle{ t }[/math] (e.g. a vocabulary member).
  - a Multiset Set, [math]\displaystyle{ \mathbf{C} }[/math] (e.g. a corpus).
- outputs: idf Score.
- definition.
  - [math]\displaystyle{ \mathrm{idf}(t, C) = \log \frac{\mid C \mid}{ \mid C(t) \mid} }[/math], where [math]\displaystyle{ \mid C \mid }[/math] is the number of documents in the corpus, and [math]\displaystyle{ \mid C(t) \mid }[/math] is the number of documents in the corpus that contain the term, [math]\displaystyle{ t }[/math].
- It can be used to create an IDF Model (of a multiset set).
- It can be a component of a TF-IDF Weight Function.
- …
Counter-Example(s):
- Relative Term Frequency.
- Pointwise Mutual Information Measure.
See: TF-IDF Weight, Stopword.

References

http://wikipedia.org/wiki/Tf%E2%80%93idf#Definition
- The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. :[math]\displaystyle{ \mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|} }[/math] with
  - [math]\displaystyle{ N }[/math]: total number of documents in the corpus
  - [math]\displaystyle{ |\{d \in D: t \in d\}| }[/math] : number of documents where the term [math]\displaystyle{ t }[/math] appears (i.e., [math]\displaystyle{ \mathrm{tf}(t,d) \neq 0 }[/math]). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to [math]\displaystyle{ 1 + |\{d \in D: t \in d\}| }[/math].
- Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result.

(Robertson, 2004) ⇒ S. Robertson. (2004). “Understanding Inverse Document Frequency: On theoretical arguments for IDF.” In: Journal of Documentation,Volume 60, Number 5.

(Spärck Jones, 1972) ⇒ Karen Spärck Jones. (1972). “A Statistical Interpretation of Term Specificity and its Application in Retrieval.” In: Journal of Documentation, 28(1). doi:10.1108/eb026526
- NOTES: Introduced "term specificity" which later became known as inverse document frequency, or IDF