tf-idf Vector Distance Function
A tf-idf Vector Distance Function is a cosine distance function between TF-IDF vectors (based on relative term frequency and inverse document frequency).
- Context:
- domain: 2 tf-idf Vectors; and an IDF Model (from the same multiset set).
- range: a Distance Score.
- It can be calculated as [math]\displaystyle{ \mathrm{tf-idf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t,D) }[/math].
- It can (often) be used as:
- a String Distance Function, by mapping each string and underlying Base Corpus as Multisets. (however, it cannot handle the Word Semantic Challenge).
- a Document Distance Function, by mapping each Document and the underlying Base Corpus as Multisets.
- a Information Retrieval Ranking Function to compare Document similarity and distance to a Keyword Query.
- a TF-IDF Ranking Function.
- ...
- Example(s):
- tf-idf Distance({a,b},{b,a},C) = 0
- tf-idf Distance({a,b},{c,d},C) = 1
- IF TF(a)=0.5, THEN TFIDF Distance({a,a,b},{a,b,b})= ???, because IDF(a)= ???
- ...
- Counter-Example(s):
- See: Term Vector Space Model; Stop-Word, TF-IDF-based Text-Item Feature Generation Algorithm.
References
2020
- (Qi, 2020) ⇒ Zhang Qi. (2020). “The Text Classification of Theft Crime based on TF-IDF and XGBoost Model.” In: 2020 IEEE International conference on artificial intelligence and computer applications (ICAICA).
- NOTE:
- It utilizes 2622 preprocessed theft crime cases from a city spanning 2009-2019, aiming to enhance crime prediction accuracy using text classification.
- It employs the TF-IDF (Term Frequency-Inverse Document Frequency) model for feature extraction, determining the relevance of words in the crime data documents.
- NOTE:
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Tf–idf Retrieved:2015-2-21.
- tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.
- tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
2011
- (Sammut & Webb, 2011) ⇒ Claude Sammut, and Geoffrey I. Webb. (2011). “TF-IDF.” In: (Sammut & Webb, 2011) p.986
2010
- http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
- QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection
docs
ofn
strings, which we will call documents in keeping with tradition. Further letdf(t,docs)
be the document frequency of tokent
, that is, the number of documents in which the tokent
appears. Then the inverse document frequency (IDF) oft
is defined by:idf(t,docs) = sqrt(log(n/df(t,docs)))
.If the document frequency
df(t,docs)
of a term is zero, thenidf(t,docs)
is set to zero. As a result, only terms that appeared in at least one training document are used during comparison.The term vector for a string is then defined by its term frequencies. If
count(t,cs)
is the count of termt
in character sequencecs
, then the term frequency (TF) is defined by:tf(t,cs) = sqrt(count(t,cs))
. The term-frequency/inverse-document frequency (TF/IDF) vectortfIdf(cs,docs)
for a character sequencecs
over a collection of documentsds
has a valuetfIdf(cs,docs)(t)
for termt
defined by:tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
The proximity between character sequences
cs1
andcs2
is defined as the cosine of their TF/IDF vectors:dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
where dot products are defined by:cos(x,y) = x . y / (|x| * |y| )
and length is defined by:x . y = Σi x[i] * y[i]
|x| = sqrt(x . x)
Distance is then just 1 minus the proximity value.
distance(cs1,cs2) = 1 - proximity(cs1,cs2)
- QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
2009
- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.
2003
- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- Two strings [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The Jaccard similarity between the word sets S and T is simply jS\Tj jS[Tj . TFIDF or 1 Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. cosine similarity, which is widely used in the information retrieval community