Skip-Gram NNLM Algorithm: Difference between revisions

Latest revision as of 13:27, 2 August 2022

A Skip-Gram NNLM Algorithm is an NNLM algorithm that uses skip-gram co-occurrence statistics (to predict surrounding words using the target word).

Context:
- It can be a Skip-Gram with Negative-Sampling (SGNS).
- It can be applied by a Skip-Gram Word Embedding System (such as word2vec).
- It has training complexity proportional to [math]\displaystyle{ Q = C \times (D + D \times \log_2 (V)); \ (5) }[/math] where C is the maximum distance of the words.
- It has performance similar to CBOW NNLM.
Example(s):
Counter-Example(s):
- Continuous-BoW Word Embedding Algorithm.
See: Skip-Gram Co-Occurrence Statistic.

References

2015

(Vilnis & McCallum, 2015) ⇒ Luke Vilnis, and Andrew McCallum. (2015). “Word Representations via Gaussian Embedding.” In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
- QUOTE: In the word2vec Skip-Gram (Mikolov et al., 2013) word embedding model, the energy function takes the form of a dot product between the vectors of an observed word and an observed context [math]\displaystyle{ w^\text{T}\gt c }[/math]. The loss function is a binary logistic regression classifier that treats the score of a word and its observed context as the score of a positive example, and the score of a word and a randomly sampled context as the score of a negative example. ... In recent work, word2vec has been shown to be equivalent to factoring certain types of weighted pointwise mutual information matrices (Levy & Goldberg, 2014). In our work, we use a slightly different loss function than Skip-Gram word2vec embeddings. Our energy functions take on a more limited range of values than do vector dot products, and their dynamic ranges depend in complex ways on the parameters. Therefore, we had difficulty using the word2vec loss that treats scores of positive and negative pairs as positive and negative examples to a binary classifier, since this relies on the ability to push up on the energy surface in an absolute, rather than relative, manner.

2013

(Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” In: Proceedings of International Conference of Learning Representations Workshop.
- QUOTE: The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.
  The training complexity of this architecture is proportional to: [math]\displaystyle{ Q = C \times (D + D \times \log_2 (V)); \ (5) }[/math] where C is the maximum distance of the words. Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and R words from the future of the current word as correct labels. This will require us to do R \ times 2 word classifications, with the current word as input, and each of the R + R words as output. In the following experiments, we use C = 10.

https://code.google.com/p/word2vec/
- This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research. <

@@ Line 1: / Line 1: @@
-A [[Skip-Gram NNLM Algorithm]] is an [[NNLM algorithm]] that uses [[skip-gram co-occurrence statistic]]s.
+A [[Skip-Gram NNLM Algorithm]] is an [[NNLM algorithm]] that uses [[skip-gram co-occurrence statistic]]s (to predict [[surrounding word]]s using the [[target word]]).
 * <B>Context:</B>
 ** It can be a [[Skip-Gram with Negative-Sampling (SGNS)]].
-** It can be implemented by a [[Skip-Gram Word Embedding System]] (such as [[word2vec]]).
+** It can be applied by a [[Skip-Gram Word Embedding System]] (such as [[word2vec]]).
 ** It has [[training complexity]] proportional to <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words.
+** It has performance similar to [[CBOW NNLM]].
 * <B>Example(s):</B>
 ** https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
+** [[File:skip-gram_NNLM_architecture.150216.jpg|300px]].
+** http://image.slidesharecdn.com/cikm-keynote-nov2014-141125182455-conversion-gate01/95/large-scale-deep-learning-jeff-dean-51-638.jpg
 * <B>Counter-Example(s):</B>
 ** [[Continuous-BoW Word Embedding Algorithm]].
 * <B>See:</B> [[Skip-Gram Co-Occurrence Statistic]].
 ----
 ----
-==References==
+== References ==
 === 2015 ===
-* ([[2015_WordRepresentationsviaGaussianE|Vilnis & McCallum, 2015]]) &rArr; [[author::Luke Vilnis]], and [[author::Andrew McCallum]]. ([[year::2015]]). "[http://arxiv.org/pdf/1412.6623v1.pdf Word Representations via Gaussian Embedding]." In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
+* ([[2015_WordRepresentationsviaGaussianE|Vilnis & McCallum, 2015]]) &rArr; [[Luke Vilnis]], and [[Andrew McCallum]]. ([[2015]]). “[http://arxiv.org/pdf/1412.6623v1.pdf Word Representations via Gaussian Embedding].” In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
-** QUOTE: In the [[word2vec Skip-Gram]] ([[Mikolov et al., 2013]]) [[word embedding model]], the [[energy function]] takes the form of a [[dot product]] between the [[vectors of an observed word]] and an [[observed context]] <math>w^\text{T}>c</math>. </s> The [[loss function]] is a [[binary logistic regression classifier]] that treats the score of a word and its [[observed]] [[word token context|context]] as the score of a [[positive example]], and the [[score]] of a [[word token|word]] and a [[randomly sampled]] [[word token context|context]] as the [[score]] of a [[negative example]]. </s> ... In recent work, [[word2vec]] has been shown to be equivalent to [[factoring]] certain types of [[weighted pointwise mutual information matrice]]s ([[Levy & Goldberg, 2014]]). </s> In [[2015_WordRepresentationsviaGaussianE|our work]], we use a slightly different [[loss function]] than [[Skip-Gram word2vec embeddings]]. </s> Our [[energy function]]s take on a more limited range of values than do [[vector dot product]]s, and their dynamic ranges depend in complex ways on the parameters. </s> Therefore, we had difficulty using the [[word2vec loss]] that treats scores of [[positive example|positive]] and [[negative example|negative]] [[word context pair|pair]]s as positive and [[negative example]]s to a [[binary classifier]], since this relies on the ability to push up on the [[energy surface]] in an [[absolute energy|absolute]], rather than [[relative energy|relative]], manner. </s>
+** QUOTE: In the [[Skip-Gram NNLM Algorithm|word2vec Skip-Gram]] ([[Mikolov et al., 2013]]) [[word embedding model]], the [[energy function]] takes the form of a [[dot product]] between the [[vectors of an observed word]] and an [[observed context]] <math>w^\text{T}>c</math>. </s> The [[loss function]] is a [[binary logistic regression classifier]] that treats the score of a word and its [[observed]] [[word token context|context]] as the score of a [[positive example]], and the [[score]] of a [[word token|word]] and a [[randomly sampled]] [[word token context|context]] as the [[score]] of a [[negative example]]. </s> ... In recent work, [[word2vec]] has been shown to be equivalent to [[factoring]] certain types of [[weighted pointwise mutual information matrice]]s ([[Levy & Goldberg, 2014]]). </s> In [[2015_WordRepresentationsviaGaussianE|our work]], [[we]] use a slightly different [[loss function]] than [[Skip-Gram word2vec embeddings]]. </s> [[Our energy function]]s take on a more limited range of values than do [[vector dot product]]s, and their dynamic ranges depend in complex ways on the parameters. </s> Therefore, we had difficulty using the [[word2vec loss]] that treats scores of [[positive example|positive]] and [[negative example|negative]] [[word context pair|pair]]s as positive and [[negative example]]s to a [[binary classifier]], since this relies on the ability to push up on the [[energy surface]] in an [[absolute energy|absolute]], rather than [[relative energy|relative]], manner. </s>
 === 2013 ===
-* ([[2013_EfficientEstimationofWordRepres|Mikolov & al, 2013a]]) &rArr; [[Tomas Mikolov]], [[Kai Chen]], [[Greg Corrado]], and [[Jeffrey Dean]]. ([[2013]]). "[http://arxiv.org/pdf/1301.3781 Efficient Estimation of Word Representations in Vector Space]." In: Proceedings of International Conference of Learning Representations Workshop.
+* ([[2013_EfficientEstimationofWordRepres|Mikolov et al., 2013a]]) &rArr; [[Tomáš Mikolov]], [[Kai Chen]], [[Greg Corrado]], and [[Jeffrey Dean]]. ([[2013]]). “[http://arxiv.org/pdf/1301.3781 Efficient Estimation of Word Representations in Vector Space].” In: Proceedings of International Conference of Learning Representations Workshop.
-** QUOTE: The [[Skip-Gram Word Embedding Algorithm|second architecture]] is similar to [[CBOW]], but instead of predicting the current word based on the context, it tries to [[maximize classification]] of a word based on another word in the same [[sentence]]. </s> More precisely, [[Skip-Gram Word Embedding Algorithm|we]] use each current word as an input to a [[log-linear classifier]] with [[continuous projection layer]], and [[predict word]]s within a certain range before and after the current word. </s> We found that increasing the range improves quality of the resulting [[word vector]]s, but it also increases the [[computational complexity]]. </s> Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our [[training examples]]. </s> <P> The training complexity of [[Skip-Gram Word Embedding Algorithm|this architecture]] is proportional to: <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words. </s> Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and [[R word]]s from the future of the current word as [[correct label]]s. </s> This will require us to do R \ times 2 [[word classification]]s, with the current word as input, and each of the R + R words as [[output]]. </s> In the following [[experiment]]s, we use C = 10. </s>
+** QUOTE: The [[Skip-Gram NNLM Algorithm|second architecture]] is similar to [[CBOW]], but instead of predicting the current word based on the context, it tries to [[maximize classification]] of a word based on another word in the same [[sentence]]. </s> More precisely, [[Skip-Gram NNLM Algorithm|we]] use each current word as an input to a [[log-linear classifier]] with [[continuous projection layer]], and [[predict word]]s within a certain range before and after the current word.  </s> [[We]] found that increasing the range improves quality of the resulting [[word vector]]s, but it also increases the [[computational complexity]]. </s> Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our [[training examples]]. </s>         <P>        The training complexity of [[Skip-Gram NNLM Algorithm|this architecture]] is proportional to: <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words. </s> Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and [[R word]]s from the future of the current word as [[correct label]]s. </s> This will require us to do R \ times 2 [[word classification]]s, with the current word as input, and each of the R + R words as [[output]]. </s> In the following [[experiment]]s, [[we]] use C = 10. </s>
 <BR>
 * https://code.google.com/p/word2vec/
-** [[word2vec System|This tool]] provides an efficient implementation of the [[Continuous Bag-of-Words Word Embedding Algorithm|continuous bag-of-words]] and [[Skip-Gram Word Embedding Algorithm|skip-gram architecture]]s for computing [[vector representations of words]]. These representations can be subsequently used in many [[natural language processing application]]s and for further research. <
+** [[word2vec System|This tool]] provides an efficient implementation of the [[Continuous Bag-of-Words Word Embedding Algorithm|continuous bag-of-words]] and [[Skip-Gram NNLM Algorithm|skip-gram architecture]]s for computing [[vector representations of words]]. These representations can be subsequently used in many [[natural language processing application]]s and for further research. <
 ----
 __NOTOC__
 [[Category:Concept]]