tf-idf Score
Jump to navigation
Jump to search
A tf-idf Score is a non-negative real number score from a tf-idf function (for a vocabulary member relative to a multiset set member).
- Context:
- It can (typically) increase with respect to Set Member Frequency (frequent vocab members within a single multiset/document are more informative than rare items).
- It can (typically) increase with respect to IDF Score (frequent vocab members over an entire multiset/corpus are less informative than rare terms).
- It can be a member of a tf-idf Vector.
- Example(s):
- [math]\displaystyle{ 0 }[/math], when every multiset contains the member.
- [math]\displaystyle{ 0.046... }[/math] for [math]\displaystyle{ \operatorname{tf-idf}(``\text{quaint}'',\text{doc}_{184}, \text{Newsgroups 20 corpus}) }[/math], i.e. [math]\displaystyle{ \frac{\log(200)}{500} \equiv \frac{4}{2,000} \times \log(\frac{8,000}{40}) }[/math], if the word quaint is present 4 times in document [math]\displaystyle{ \text{doc}_{184} }[/math]with 2,000 words, and is contained in 40 documents from a corpus with 8,000 documents.
- …
- Counter-Example(s):
- a PMI Score.
- See: TF-IDF Ranking Function.
References
2009
- http://en.wikipedia.org/wiki/Tf%E2%80%93idf
- The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
- One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
- http://en.wikipedia.org/wiki/Tf%E2%80%93idf#Mathematical_details
- A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. The tf-idf value for a term will always be greater than or equal to zero.
2007
- (Pazzani & Billsus, 2007) ⇒ Michael J. Pazzani, and Daniel Billsus. (2007). “Content-based Recommendation Systems.” In: The adaptive web. Springer Berlin Heidelberg, 2007.
- QUOTE: ... associated with a term is a real number that represents the importance or relevance. This value is called the tf*idf weight (term-frequency times inverse document frequency). The tf*idf weight, w(t,d), of a term t in a document d is a function of the frequency of t in the document (tft,d), the number of documents that contain the term (dft) and the number of documents in the collection (N)[1]
- ↑ Note that in the description of tf*idf weights, the word “document” is traditionally used since the original motivation was to retrieve documents. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called tf*idf. In general, tf*idf systems have weights that increase monotonically with term frequency and decrease monotonically with document frequency.