Distributional Co-Occurrence Text-Item Vector

A Distributional Co-Occurrence Text-Item Vector is a text-item vector based on neighborhood context and a distributional semantics heuristic.

Context:
- It can range from being Distributional Word Vectors, Distributional Phrase Vectors, Distributional Sentence Vectors, Distributional Document Vectors, ...
- It can associated to Distributional Text-Item Vector Space, defined by a distributional text-item vectorizing function.
- It can range from being a Continuous Distributional Text-Item Vector to being a Concrete Distributional Text-Item Vector.
- It can range from being a Dense Distributional Text-Item Vector to being a Sparse Distributional Text-Item Vector.
- …
Example(s):
- Distributional Word Vector, such as: [0.538, 0.019, ..., 0.830] for “King”.
- Distributional Phrase Vector (a phrase vector).
- Distributional Sentence Vector (a sentence vector).
- Distributional Paragraph Vector (a paragraph vector).
- Distributional Document Vector (a document vector).
- …
Counter-Example(s):
- a Bag-of-Words Vector.
- a Labeled Text-Item.
See: Distributional Text-Item Vector Model Creation, Harris' Hypothesis, Feature Learning, Hidden Neural Network Layer, Dimensionally Compressed Space.

References

2015

(Dai et al., 2015) ⇒ Andrew M. Dai, Christopher Olah, and Quoc V. Le. (2015). “Document Embedding with Paragraph Vectors.” In: NIPS Deep Learning Workshop.

2014a

(Le & Mikolov, 2014) ⇒ Quoc V. Le, and Tomáš Mikolov. (2014). “Distributed Representations of Sentences and Documents.” In: Proceedings of The 31st International Conference on Machine Learning (ICML 2014).
- QUOTE: In this paper, we propose Paragraph Vector, an unsupervised framework that learns continuous distributed vector representations for pieces of texts. The texts can be of variable-length, ranging from sentences to documents. The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document. In our model, the vector representation is trained to be useful for predicting words in a paragraph. More precisely, we concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropagation (Rumelhart et al., 1986). While paragraph vectors are unique among paragraphs, the word vectors are shared. At prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until convergence.
  Our technique is inspired by the recent work in learning vector representations of words using neural networks (Bengio et al., 2006; Collobert & Weston, 2008; Mnih & Hinton, 2008; Turian et al., 2010; Mikolov et al., 2013a;c). In their formulation, each word is represented by a vector which is concatenated or averaged with other word vectors in a context, and the resulting vector is used to predict other words in the context. For example, the neural network language model proposed in (Bengio et al., 2006) uses the concatenation of several previous word vectors to form the input of a neural network, and tries to predict the next word. The outcome is that after the model is trained, the word vectors are mapped into a vector space such that

2014b

(Mikolov et al., 2014) ⇒ Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. (2014). “Distributed Representations of Words and Phrases and their Compositionality.” In: Advances in Neural Information Processing Systems, 26.

2006

(Bengio et al., 2006) ⇒ Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. (2006). “Neural Probabilistic Language Models.” In: Innovations in Machine Learning, D. Holmes and L.C. Jain, eds. doi:10.1007/3-540-33486-6_6
- ABSTRACT: A central goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on several methods to speed-up both training and probability computation, as well as comparative experiments to evaluate the improvements brought by these techniques. We finally describe the incorporation of this new language model into a state-of-the-art speech recognizer of conversational speech.

1991

(Elman, 1991) ⇒ Jeffrey L. Elman. (1991). “Distributed Representations, Simple Recurrent Networks, and Grammatical Structure.” In: Machine Learning, 7.

1990a

(Deerwester et al., 1990) ⇒ Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. (1990). “Indexing by Latent Semantic Analysis.” In: JASIS, 41(6).

1990b

(Pollack, 1990) ⇒ Jordan B. Pollack. (1990). “Recursive Distributed Representations.” In: Artificial Intelligence, 46(1).

1984

(Hinton, 1984) ⇒ Geoffrey E. Hinton. (1984). “Distributed Representations."

Distributional Co-Occurrence Text-Item Vector

References

2015

2014a

2014b

2006

1991

1990a

1990b

1984

Navigation menu

Search