Document-Wise Co-Occurrence Statistic
A Document-Wise Co-Occurrence Statistic is a word-word co-occurrence statistic that uses [in-the-same-document]] as a co-occurrence relation.
- Context:
- It can range from being a Boolean Document-Wise Co-Occurrence Statistic to being a Count-based Document-Wise Co-Occurrence Statistic.
- Counter-Example(s):
- See: Bag-of-Words Model.
References
2010
- (Momtazi et al., 2010) ⇒ Saeedeh Momtazi, Sanjeev Khudanpur, and Dietrich Klakow. (2010). “A Comparative Study of Word Co-occurrence for Term Clustering in Language Model-based Sentence Retrieval.” In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. ISBN:1-932432-65-5
- QUOTE: If two content words [math]\displaystyle{ w }[/math] and [math]\displaystyle{ w' }[/math] are seen in the same document, they are usually topically related. In this notion of co-occurrence, how near or far away from each other they are in the document is irrelevant, as is their order of appearance in the document. Document-wise co-occurrence has been successfully used in many NLP applications such as automatic thesaurus generation (Manning et al., 2008)
Statistics of document-wise co-occurrence may be collected in two different ways. In the first case, [math]\displaystyle{ f_{ww'} = f_{w'w} }[/math] is simply the number of documents that contain both [math]\displaystyle{ w }[/math] and [math]\displaystyle{ w' }[/math] . This is usually the notion used in ad hoc retrieval. Alternatively, we may want to treat each instance of [math]\displaystyle{ w' }[/math] in a document that contains an instance of [math]\displaystyle{ w }[/math] to be a co-occurrence event. Therefore if [math]\displaystyle{ w' }[/math] appears three times in a document that contains two instances of [math]\displaystyle{ w }[/math], the former method counts it as one co-occurrence, while the latter as six co-occurrences. We use the latter statistic, since we are concerned with retrieving sentence sized “documents,” wherein a repeated word is more significant.
- QUOTE: If two content words [math]\displaystyle{ w }[/math] and [math]\displaystyle{ w' }[/math] are seen in the same document, they are usually topically related. In this notion of co-occurrence, how near or far away from each other they are in the document is irrelevant, as is their order of appearance in the document. Document-wise co-occurrence has been successfully used in many NLP applications such as automatic thesaurus generation (Manning et al., 2008)