Bag-of-Words Vectorization Task
Jump to navigation
Jump to search
A Bag-of-Words Vectorization Task is a corpus-based vectorization task that converts text items to bag-of-words vectors.
- AKA: Bag-of-Words Text Item Mapping.
- Context:
- It can range from (typically) being a Data-Driven Bag-of-Words Mapping Task to being a Heuristic Bag-of-Words Mapping Task.
- It can be solved by a Bag-of-Words Mapping System (that implements a bag-of-words mapping algorithm).
- See: Continuous Vector Space Mapping Task.
References
2014
- http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
- QUOTE: … the most common ways to extract numerical features from text content, namely:
- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- counting the occurrences of tokens in each document.
- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
- In this scheme, features and samples are defined as follows:
- each individual token occurrence frequency (normalized or not) is treated as a feature.
- the vector of all the token frequencies for a given document is considered a multivariate sample.
- QUOTE: … the most common ways to extract numerical features from text content, namely: