Continuous Dense Distributional Word Model Training Algorithm

AKA: Word Embedding Algorithm.
Context:
- …
Example(s):
- a Neural Network Language Model-based Distributional Word Embedding Algorithm, such as: a word2vec algorithm.
- a SSPMI Algorithm, (Levy & Goldberg, 2014)
- a GloVe Algorithm, (Pennington et al., 2014).
- …
Counter-Example(s):
- a Discrete Dense Distributional Phrase Model Training Algorithm.
- a Next-Word Model Training Algorithm.
- a Language Model Training Algorithm.
- a Sentence Embedding Algorithm, such as (Le & Mikolov, 2014).
- a Document Embedding Algorithm, such as (Le & Mikolov, 2014).
See: Embedding Algorithm, SGNS Algorithm, Word Embeddings, Distributional Word Model Training Algorithm, Continuous Dense Word Model, Text Item, Dense Word Model Training Algorithm, Dense Word Model Training Algorithm, Distributional Word Model Training Algorithm.

References

(Levy & Goldberg, 2014) ⇒ Omer Levy, and Yoav Goldberg. (2014). “Neural Word Embedding As Implicit Matrix Factorization.” In: Advances in Neural Information Processing Systems.
- QUOTE: Recently, there has been a surge of work proposing to represent words as dense vectors, derived using various training methods inspired from neural-network language modeling [3, 9, 23, 21].
(Mikolov et al., 2014) ⇒ Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. (2014). “Distributed Representations of Words and Phrases and their Compositionality.” In: Advances in Neural Information Processing Systems, 26.

(Bengio et al., 2003a) ⇒ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. (2003). “A Neural Probabilistic Language Model.” In: The Journal of Machine Learning Research, 3.
- QUOTE: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

(Hinton, 1986) ⇒ Geoffrey E. Hinton. (1986). “Learning Distributed Representations of Concepts.” In: Proceedings of the eighth annual conference of the cognitive science society.
- QUOTE: Concepts can be represented by distributed patterns of activity in networks of neuron-like units. One advantage of this kind of representation is that it leads to automatic generalization.