Bag-of-Words Vectorization Task

From GM-RKB

Jump to navigation Jump to search

A Bag-of-Words Vectorization Task is a corpus-based vectorization task that converts text items to bag-of-words vectors.

AKA: Bag-of-Words Text Item Mapping.
Context:
- It can range from (typically) being a Data-Driven Bag-of-Words Mapping Task to being a Heuristic Bag-of-Words Mapping Task.
- It can be solved by a Bag-of-Words Mapping System (that implements a bag-of-words mapping algorithm).
See: Continuous Vector Space Mapping Task.

References

2014

http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
- QUOTE: … the most common ways to extract numerical features from text content, namely:
  - tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
  - counting the occurrences of tokens in each document.
  - normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
- In this scheme, features and samples are defined as follows:
  - each individual token occurrence frequency (normalized or not) is treated as a feature.
  - the vector of all the token frequencies for a given document is considered a multivariate sample.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Bag-of-Words_Vectorization_Task&oldid=619851"

Concept