CBOW NNLM Algorithm

Context:
- It can (typically) share the weights of different positions.
- It can (typically) have a Linear Hidden Layer (and therefore not shown in the nnet architecture).
- It can be a CBOW NNLM with Negative Sampling Algorithm.
- It can include Vocabulary Downsampling.
- …
Example(s):
- http://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#422
- .
Counter-Example(s):
- Skip-Gram NNLM Algorithm.
- GloVe Algorithm.
See: word2vec System.

References

(Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” In: Proceedings of International Conference of Learning Representations Workshop.
- QUOTE: The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). We call this architecture a bag-of-words model as the order of words in the history does not influence the projection. Furthermore, we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word. Training complexity is then: [math]\displaystyle{ Q = N \times D + D \times \log_2 (V). \lt /s\gt \(4) }[/math] We denote this model further as CBOW, as unlike standard bag-of-words model, it uses continuous distributed representation of the context. The model architecture is shown at Figure 1. Note that the weight matrix between the input and the projection layer is shared for all word positions in the same way as in the NNLM.