word2phrase System
Jump to navigation
Jump to search
A word2phrase System is a unsupervised lexing system developed by Milokov & al at Google, Inc..
References
2014
- https://code.google.com/p/word2vec/source/browse/trunk/word2phrase.c?r=3
#define MAX_STRING 60
const int vocab_hash_size = 500000000; // Maximum 500M entries in the vocabulary
typedef float real; // Precision of float numbers
- (Mikolov et al., 2014) ⇒ Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. (2014). “Distributed Representations of Words and Phrases and their Compositionality.” In: Advances in Neural Information Processing Systems 26 (NIPS 2013).
- QUOTE: As discussed earlier, many phrases have a meaning that is not a simple composition of the meanings of its individual words. To learn vector representation for phrases, we first find words that appear frequently together, and infrequently in other contexts. For example, “New York Times” and “Toronto Maple Leafs” are replaced by unique tokens in the training data, while a bigram “this is” will remain unchanged. This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using : [math]\displaystyle{ \operatorname{score} (w_i, w_j) = \frac{\text{count} (w_j) - \delta}{\text{count} (w_i) \times \text{count} (w_j) }. \ (6) }[/math] The \ delta is used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with score above the chosen threshold are then used as phrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allowing longer phrases that consists of several words to be formed. We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task. This dataset is publicly available on the web [1].