IDF Scoring Function

From GM-RKB
(Redirected from inverse document frequency)
Jump to navigation Jump to search

An IDF Scoring Function is a vocabulary member scoring function based on the logarithm of the proportion of multisets that contain the vocabulary member.



References

2015

  • http://wikipedia.org/wiki/Tf%E2%80%93idf#Definition
    • The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. :[math]\displaystyle{ \mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|} }[/math] with
      • [math]\displaystyle{ N }[/math]: total number of documents in the corpus
      • [math]\displaystyle{ |\{d \in D: t \in d\}| }[/math] : number of documents where the term [math]\displaystyle{ t }[/math] appears (i.e., [math]\displaystyle{ \mathrm{tf}(t,d) \neq 0 }[/math]). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to [math]\displaystyle{ 1 + |\{d \in D: t \in d\}| }[/math].
    • Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result.

2004

  • (Robertson, 2004) ⇒ S. Robertson. (2004). “Understanding Inverse Document Frequency: On theoretical arguments for IDF.” In: Journal of Documentation,Volume 60, Number 5.

1972