Bag-of-Words Vector
A Bag-of-Words Vector is a text-item integer vector where each vector member contain the statistic of the one-hot codes for each vocabulary term.
- AKA: BoW Record, Lexical Item Statistic Vector, Term-Count Vector.
- Context:
- It can (typically) be produced by a Bag-of-Words Vectorizing System (that implements a bag-of-words mapping model).
- It can (typically) be associated with a Word Set (typically a core word list that excludes stop words).
- It can (typically) be a Sparse Vector.
- It can (in the abstract) be a member of a Bag-of-Words Vector Space.
- It can range from being a Binarized BoW Vector to being a Weighted BoW Vector.
- It can range, based on Lexical Item Types, as being an Orthographic Word Vector, Stemmed Word Vector, or a Word Form Vectors (with Compound Words such as Technical Terms).
- It can range from being a Binary Word Vector to being a Word Occurrence Vector (e.g. with the word's Word Relative Frequency).
- It can be used as a Lexical Pattern (e.g. for a keyword search query).
- Example(s):
- a Frequency-Count BoW Vector, such as: <0,2,0,0,0,1,...,0,3,0,1>
- a Binarized BoW Vector, such as: <0,1,0,0,0,1,...,0,1,0,1>.
- a BoW Document Vector.
- a Passage Word Vector, such as a Word Mention Context Window/Text Window Vector.
- …
- Counter-Example(s):
- a Distributional Text-Item Vector.
- a Bag of Character n-Grams.
- a Syntactic Pattern.
- a Text Token Substring, such as a Word Mention String.
- a Text Graph.
- See: Lexical Pattern, Text Window, Statistical Classification.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Bag-of-words_model Retrieved:2015-2-1.
- The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Recently, the bag-of-words model has also been used for computer vision.
The bag-of-words model is commonly used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature for training a classifier.
- The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Recently, the bag-of-words model has also been used for computer vision.
1997
- (Joachims, 1997a) ⇒ Thorsten Joachims. (1997). “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.” In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997). ISBN:1-55860-486-3
- QUOTE:The representation of a problem has a strong impact on the generalization accuracy of a learning system. For categorization a document, which typically is a string of characters, has to be transformed into a representation which is suitable for the learning algorithm and the classification task. IR research suggests that words work well as representation units and that their ordering in a document is of minor importance for many tasks. This leads to a representation of documents as bags of words.
This bag-of-words representation is equivalent to an attribute-value representation as used in machine learning. Each distinct word corresponds to a feature with the number of times the word occurs in the document as its value. Figure 1 shows an example feature vector for a particular document. To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least [math]\displaystyle{ m }[/math] (e.g. [math]\displaystyle{ m }[/math] = 3) times. The set of considered features (i.e. words) will be called F.
- QUOTE:The representation of a problem has a strong impact on the generalization accuracy of a learning system. For categorization a document, which typically is a string of characters, has to be transformed into a representation which is suitable for the learning algorithm and the classification task. IR research suggests that words work well as representation units and that their ordering in a document is of minor importance for many tasks. This leads to a representation of documents as bags of words.
1993
- (Yarowsky, 1993) ⇒ David Yarowsky. (1993). “One Sense per Collocation.” In: Proceedings of the Workshop on Human Language Technology. doi:10.3115/1075671.1075731
- QUOTE:We discussed the implications of these results for data set creation and algorithm design, identifying potential weaknesses in the common “bag of words” approach to disambiguation.
1988
- (Salton, 1988) ⇒ Gerard M. Salton. (1988). “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.” Addison-Wesley. ISBN:0201122278
1954
- (Harris, 1954) ⇒ Zellig Harris. (1954). “Distributional Structure.” Word 10 (2/3)
- QUOTE: And this stock of combinations of elements becomes a factor in the way later choices are made … for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use.