Shifted PPMI Matrix
A Shifted PPMI Matrix is a word-context PMI matrix where each matrix cell is computed as [math]\displaystyle{ \operatorname{max}(0.0, PMI(w, c) - \log(k)) }[/math], where [math]\displaystyle{ k }[/math] is the number of negative samples.
- …
- Counter-Example(s):
- See: Weighted Matrix, SPPMI-SVD Algorithm.
References
2014
- (Řehůřek, 2014a) ⇒ Radim Řehůřek. (2014-12-23). http://radimrehurek.com/2014/12/making-sense-of-word2vec/
- QUOTE: In a manner analogous to GloVe, Levy and Goldberg analyzed the objective function of word2vec with negative sampling. ... “SPPMI” name simply reflects that we’re subtracting [math]\displaystyle{ \log(k) }[/math] from PMI (“shifting”) ... the word x context objective “source” matrix is computed differently to GloVe. Each matrix cell, corresponding to word w and context word c, is computed as [math]\displaystyle{ \operatorname{max}(0.0, PMI(w, c) - \log(k)) }[/math], where k is the number of negative samples in word2vec (for example, [math]\displaystyle{ k=10 }[/math]). PMI is the standard pointwise mutual information — if we use the notation that word w and context c occurred together \#wc times in the training corpus, then [math]\displaystyle{ PMI(w, c) = \log \frac{\#wc * \sum{w,c}\#wc}{\sum_c\#wc * \sum_w\#wc} }[/math] (no smoothing).
The ... “SPPMI” name simply reflects that we’re subtracting [math]\displaystyle{ \log(k) }[/math] from PMI (“shifting”) and that we’re taking the [math]\displaystyle{ \operatorname{max|(0.0, SPMI) }[/math] (“positive”; should be non-negative, really). So, Shifted Positive Pointwise Mutual Information.
For example, for the same nine texts we used above and k=1, the SPPMI matrix looks like this:
- QUOTE: In a manner analogous to GloVe, Levy and Goldberg analyzed the objective function of word2vec with negative sampling. ... “SPPMI” name simply reflects that we’re subtracting [math]\displaystyle{ \log(k) }[/math] from PMI (“shifting”) ... the word x context objective “source” matrix is computed differently to GloVe. Each matrix cell, corresponding to word w and context word c, is computed as [math]\displaystyle{ \operatorname{max}(0.0, PMI(w, c) - \log(k)) }[/math], where k is the number of negative samples in word2vec (for example, [math]\displaystyle{ k=10 }[/math]). PMI is the standard pointwise mutual information — if we use the notation that word w and context c occurred together \#wc times in the training corpus, then [math]\displaystyle{ PMI(w, c) = \log \frac{\#wc * \sum{w,c}\#wc}{\sum_c\#wc * \sum_w\#wc} }[/math] (no smoothing).
1 | [[0. 0.83 0.83 0.49 0.49 0. 0.49 0.13 0. 0. 0. 0. ] 2 | [ 0.83 0. 1.16 0. 0. 0.83 0. 0. 0.98 0. 0. 0. ] 3 | [ 0.83 1.16 0. 0. 0. 0.13 0. 0.47 0.98 0. 0. 0. ] 4 | [ 0.49 0. 0. 0. 0.49 0. 1.18 0.83 0. 0. 0. 0. ] 5 | [ 0.49 0. 0. 0.49 0. 0. 0.49 0.13 0. 0. 0.83 1.05] 6 | [ 0. 0.83 0.13 0. 0. 0. 0. 0.13 1.05 0. 0. 0. ] 7 | [ 0.49 0. 0. 1.18 0.49 0. 0. 0.83 0. 0. 0. 0. ] 8 | [ 0.13 0. 0.47 0.83 0.13 0.13 0.83 0. 0.29 0. 0. 0. ] 9 | [ 0. 0.98 0.98 0. 0. 1.05 0. 0.29 0. 0. 0. 0. ] 10 | [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2.37 1.9] 11 | [ 0. 0. 0. 0. 0.83 0. 0. 0. 0. 2.37 0. 2.08] 12 | [ 0. 0. 0. 0. 1.05 0. 0. 0. 0. 1.9 2.08 0.]]
- No neural network training, no parameter tuning, we can directly take rows of this SPPMI matrix to be the word vectors.
... The SPPMI-SVD method simply factorizes the sparse SPPMI matrix using Singular Value Decomposition (SVD), rather than the gradient descent methods of word2vec/GloVe, and uses the (dense) left singular vectors as the final word embeddings.
- (Levy & Goldberg, 2014) ⇒ Omer Levy, and Yoav Goldberg. (2014). “Neural Word Embedding As Implicit Matrix Factorization.” In: Advances in Neural Information Processing Systems (NIPS 2014).
- QUOTE: We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks.