IDF Scoring Function
(Redirected from idf measure)
Jump to navigation
Jump to search
An IDF Scoring Function is a vocabulary member scoring function based on the logarithm of the proportion of multisets that contain the vocabulary member.
- AKA: Inverse Document Frequency Measure.
- Context:
- inputs, [math]\displaystyle{ (t,\mathbf{C}) }[/math]
- a Multiset Member, [math]\displaystyle{ t }[/math] (e.g. a vocabulary member).
- a Multiset Set, [math]\displaystyle{ \mathbf{C} }[/math] (e.g. a corpus).
- outputs: idf Score.
- definition.
- [math]\displaystyle{ \mathrm{idf}(t, C) = \log \frac{\mid C \mid}{ \mid C(t) \mid} }[/math], where [math]\displaystyle{ \mid C \mid }[/math] is the number of documents in the corpus, and [math]\displaystyle{ \mid C(t) \mid }[/math] is the number of documents in the corpus that contain the term, [math]\displaystyle{ t }[/math].
- It can be used to create an IDF Model (of a multiset set).
- It can be a component of a TF-IDF Weight Function.
- …
- inputs, [math]\displaystyle{ (t,\mathbf{C}) }[/math]
- Counter-Example(s):
- See: TF-IDF Weight, Stopword.
References
2015
- http://wikipedia.org/wiki/Tf%E2%80%93idf#Definition
- The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. :[math]\displaystyle{ \mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|} }[/math] with
- [math]\displaystyle{ N }[/math]: total number of documents in the corpus
- [math]\displaystyle{ |\{d \in D: t \in d\}| }[/math] : number of documents where the term [math]\displaystyle{ t }[/math] appears (i.e., [math]\displaystyle{ \mathrm{tf}(t,d) \neq 0 }[/math]). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to [math]\displaystyle{ 1 + |\{d \in D: t \in d\}| }[/math].
- Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result.
- The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. :[math]\displaystyle{ \mathrm{idf}(t, D) = \log \frac{N}{|\{d \in D: t \in d\}|} }[/math] with
2004
- (Robertson, 2004) ⇒ S. Robertson. (2004). “Understanding Inverse Document Frequency: On theoretical arguments for IDF.” In: Journal of Documentation,Volume 60, Number 5.
1972
- (Spärck Jones, 1972) ⇒ Karen Spärck Jones. (1972). “A Statistical Interpretation of Term Specificity and its Application in Retrieval.” In: Journal of Documentation, 28(1). doi:10.1108/eb026526
- NOTES: Introduced "term specificity" which later became known as inverse document frequency, or IDF