Skip-Gram NNLM Algorithm
(Redirected from Skip-Gram Algorithm)
Jump to navigation
Jump to search
A Skip-Gram NNLM Algorithm is an NNLM algorithm that uses skip-gram co-occurrence statistics (to predict surrounding words using the target word).
- Context:
- It can be a Skip-Gram with Negative-Sampling (SGNS).
- It can be applied by a Skip-Gram Word Embedding System (such as word2vec).
- It has training complexity proportional to [math]\displaystyle{ Q = C \times (D + D \times \log_2 (V)); \ (5) }[/math] where C is the maximum distance of the words.
- It has performance similar to CBOW NNLM.
- Example(s):
- Counter-Example(s):
- See: Skip-Gram Co-Occurrence Statistic.
References
2015
- (Vilnis & McCallum, 2015) ⇒ Luke Vilnis, and Andrew McCallum. (2015). “Word Representations via Gaussian Embedding.” In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
- QUOTE: In the word2vec Skip-Gram (Mikolov et al., 2013) word embedding model, the energy function takes the form of a dot product between the vectors of an observed word and an observed context [math]\displaystyle{ w^\text{T}\gt c }[/math]. The loss function is a binary logistic regression classifier that treats the score of a word and its observed context as the score of a positive example, and the score of a word and a randomly sampled context as the score of a negative example. ... In recent work, word2vec has been shown to be equivalent to factoring certain types of weighted pointwise mutual information matrices (Levy & Goldberg, 2014). In our work, we use a slightly different loss function than Skip-Gram word2vec embeddings. Our energy functions take on a more limited range of values than do vector dot products, and their dynamic ranges depend in complex ways on the parameters. Therefore, we had difficulty using the word2vec loss that treats scores of positive and negative pairs as positive and negative examples to a binary classifier, since this relies on the ability to push up on the energy surface in an absolute, rather than relative, manner.
2013
- (Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” In: Proceedings of International Conference of Learning Representations Workshop.
- QUOTE: The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.
The training complexity of this architecture is proportional to: [math]\displaystyle{ Q = C \times (D + D \times \log_2 (V)); \ (5) }[/math] where C is the maximum distance of the words. Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and R words from the future of the current word as correct labels. This will require us to do R \ times 2 word classifications, with the current word as input, and each of the R + R words as output. In the following experiments, we use C = 10.
- QUOTE: The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.
- https://code.google.com/p/word2vec/
- This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research. <