Skip-Gram NNLM Algorithm: Difference between revisions

(Created page with "A Skip-Gram NNLM Algorithm is an NNLM algorithm that uses skip-gram co-occurrence statistics. * <B>Context:</B> ** It can be a Skip-Gram with Negative-Sampling (...")
 
m (Text replacement - ". " to ". ")
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
A [[Skip-Gram NNLM Algorithm]] is an [[NNLM algorithm]] that uses [[skip-gram co-occurrence statistic]]s.
A [[Skip-Gram NNLM Algorithm]] is an [[NNLM algorithm]] that uses [[skip-gram co-occurrence statistic]]s (to predict [[surrounding word]]s using the [[target word]]).
* <B>Context:</B>
* <B>Context:</B>
** It can be a [[Skip-Gram with Negative-Sampling (SGNS)]].
** It can be a [[Skip-Gram with Negative-Sampling (SGNS)]].
** It can be implemented by a [[Skip-Gram Word Embedding System]] (such as [[word2vec]]).
** It can be applied by a [[Skip-Gram Word Embedding System]] (such as [[word2vec]]).
** It has [[training complexity]] proportional to <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words.
** It has [[training complexity]] proportional to <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words.
** It has performance similar to [[CBOW NNLM]].
* <B>Example(s):</B>
* <B>Example(s):</B>
** https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
** https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
** [[File:skip-gram_NNLM_architecture.150216.jpg|300px]].
** http://image.slidesharecdn.com/cikm-keynote-nov2014-141125182455-conversion-gate01/95/large-scale-deep-learning-jeff-dean-51-638.jpg
* <B>Counter-Example(s):</B>
* <B>Counter-Example(s):</B>
** [[Continuous-BoW Word Embedding Algorithm]].
** [[Continuous-BoW Word Embedding Algorithm]].
* <B>See:</B> [[Skip-Gram Co-Occurrence Statistic]].
* <B>See:</B> [[Skip-Gram Co-Occurrence Statistic]].
----
----
----
----
==References==
 
== References ==


=== 2015 ===
=== 2015 ===
* ([[2015_WordRepresentationsviaGaussianE|Vilnis & McCallum, 2015]]) &rArr; [[author::Luke Vilnis]], and [[author::Andrew McCallum]]. ([[year::2015]]). "[http://arxiv.org/pdf/1412.6623v1.pdf Word Representations via Gaussian Embedding]." In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.  
* ([[2015_WordRepresentationsviaGaussianE|Vilnis & McCallum, 2015]]) &rArr; [[Luke Vilnis]], and [[Andrew McCallum]]. ([[2015]]). [http://arxiv.org/pdf/1412.6623v1.pdf Word Representations via Gaussian Embedding].In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
** QUOTE: In the [[word2vec Skip-Gram]] ([[Mikolov et al., 2013]]) [[word embedding model]], the [[energy function]] takes the form of a [[dot product]] between the [[vectors of an observed word]] and an [[observed context]] <math>w^\text{T}>c</math>. </s> The [[loss function]] is a [[binary logistic regression classifier]] that treats the score of a word and its [[observed]] [[word token context|context]] as the score of a [[positive example]], and the [[score]] of a [[word token|word]] and a [[randomly sampled]] [[word token context|context]] as the [[score]] of a [[negative example]]. </s> ... In recent work, [[word2vec]] has been shown to be equivalent to [[factoring]] certain types of [[weighted pointwise mutual information matrice]]s ([[Levy & Goldberg, 2014]]). </s> In [[2015_WordRepresentationsviaGaussianE|our work]], we use a slightly different [[loss function]] than [[Skip-Gram word2vec embeddings]]. </s> Our [[energy function]]s take on a more limited range of values than do [[vector dot product]]s, and their dynamic ranges depend in complex ways on the parameters. </s> Therefore, we had difficulty using the [[word2vec loss]] that treats scores of [[positive example|positive]] and [[negative example|negative]] [[word context pair|pair]]s as positive and [[negative example]]s to a [[binary classifier]], since this relies on the ability to push up on the [[energy surface]] in an [[absolute energy|absolute]], rather than [[relative energy|relative]], manner. </s>
** QUOTE: In the [[Skip-Gram NNLM Algorithm|word2vec Skip-Gram]] ([[Mikolov et al., 2013]]) [[word embedding model]], the [[energy function]] takes the form of a [[dot product]] between the [[vectors of an observed word]] and an [[observed context]] <math>w^\text{T}>c</math>. </s> The [[loss function]] is a [[binary logistic regression classifier]] that treats the score of a word and its [[observed]] [[word token context|context]] as the score of a [[positive example]], and the [[score]] of a [[word token|word]] and a [[randomly sampled]] [[word token context|context]] as the [[score]] of a [[negative example]]. </s> ... In recent work, [[word2vec]] has been shown to be equivalent to [[factoring]] certain types of [[weighted pointwise mutual information matrice]]s ([[Levy & Goldberg, 2014]]). </s> In [[2015_WordRepresentationsviaGaussianE|our work]], [[we]] use a slightly different [[loss function]] than [[Skip-Gram word2vec embeddings]]. </s> [[Our energy function]]s take on a more limited range of values than do [[vector dot product]]s, and their dynamic ranges depend in complex ways on the parameters. </s> Therefore, we had difficulty using the [[word2vec loss]] that treats scores of [[positive example|positive]] and [[negative example|negative]] [[word context pair|pair]]s as positive and [[negative example]]s to a [[binary classifier]], since this relies on the ability to push up on the [[energy surface]] in an [[absolute energy|absolute]], rather than [[relative energy|relative]], manner. </s>


=== 2013 ===
=== 2013 ===
* ([[2013_EfficientEstimationofWordRepres|Mikolov & al, 2013a]]) &rArr; [[Tomas Mikolov]], [[Kai Chen]], [[Greg Corrado]], and [[Jeffrey Dean]]. ([[2013]]). "[http://arxiv.org/pdf/1301.3781 Efficient Estimation of Word Representations in Vector Space]." In: Proceedings of International Conference of Learning Representations Workshop.  
* ([[2013_EfficientEstimationofWordRepres|Mikolov et al., 2013a]]) &rArr; [[Tomáš Mikolov]], [[Kai Chen]], [[Greg Corrado]], and [[Jeffrey Dean]]. ([[2013]]). [http://arxiv.org/pdf/1301.3781 Efficient Estimation of Word Representations in Vector Space].In: Proceedings of International Conference of Learning Representations Workshop.
** QUOTE: The [[Skip-Gram Word Embedding Algorithm|second architecture]] is similar to [[CBOW]], but instead of predicting the current word based on the context, it tries to [[maximize classification]] of a word based on another word in the same [[sentence]]. </s> More precisely, [[Skip-Gram Word Embedding Algorithm|we]] use each current word as an input to a [[log-linear classifier]] with [[continuous projection layer]], and [[predict word]]s within a certain range before and after the current word. </s> We found that increasing the range improves quality of the resulting [[word vector]]s, but it also increases the [[computational complexity]]. </s> Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our [[training examples]]. </s> <P> The training complexity of [[Skip-Gram Word Embedding Algorithm|this architecture]] is proportional to: <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words. </s> Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and [[R word]]s from the future of the current word as [[correct label]]s. </s> This will require us to do R \ times 2 [[word classification]]s, with the current word as input, and each of the R + R words as [[output]]. </s> In the following [[experiment]]s, we use C = 10. </s>
** QUOTE: The [[Skip-Gram NNLM Algorithm|second architecture]] is similar to [[CBOW]], but instead of predicting the current word based on the context, it tries to [[maximize classification]] of a word based on another word in the same [[sentence]]. </s> More precisely, [[Skip-Gram NNLM Algorithm|we]] use each current word as an input to a [[log-linear classifier]] with [[continuous projection layer]], and [[predict word]]s within a certain range before and after the current word. </s> [[We]] found that increasing the range improves quality of the resulting [[word vector]]s, but it also increases the [[computational complexity]]. </s> Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our [[training examples]]. </s>         <P>       The training complexity of [[Skip-Gram NNLM Algorithm|this architecture]] is proportional to: <math>Q = C \times (D + D \times \log_2 (V)); \ (5) </math> where C is the [[maximum distance]] of the words. </s> Thus, if we choose C = 5, for each training word we will select randomly a number R in range < 1; C >, and then use R words from history and [[R word]]s from the future of the current word as [[correct label]]s. </s> This will require us to do R \ times 2 [[word classification]]s, with the current word as input, and each of the R + R words as [[output]]. </s> In the following [[experiment]]s, [[we]] use C = 10. </s>
<BR>
<BR>
* https://code.google.com/p/word2vec/
* https://code.google.com/p/word2vec/
** [[word2vec System|This tool]] provides an efficient implementation of the [[Continuous Bag-of-Words Word Embedding Algorithm|continuous bag-of-words]] and [[Skip-Gram Word Embedding Algorithm|skip-gram architecture]]s for computing [[vector representations of words]]. These representations can be subsequently used in many [[natural language processing application]]s and for further research. <
** [[word2vec System|This tool]] provides an efficient implementation of the [[Continuous Bag-of-Words Word Embedding Algorithm|continuous bag-of-words]] and [[Skip-Gram NNLM Algorithm|skip-gram architecture]]s for computing [[vector representations of words]]. These representations can be subsequently used in many [[natural language processing application]]s and for further research. <


----
----
__NOTOC__
__NOTOC__
[[Category:Concept]]
[[Category:Concept]]

Latest revision as of 13:27, 2 August 2022

A Skip-Gram NNLM Algorithm is an NNLM algorithm that uses skip-gram co-occurrence statistics (to predict surrounding words using the target word).



References

2015

2013