Distributional Co-Occurrence Word Vector

A Distributional Co-Occurrence Word Vector is a word vector that is a distributional co-occurrence text-item vector from a distributional word vector space (based on word co-occurrence statistics from some corpus).

Context:
- It can (typically) be a member of a Distributional Word Vector Space.
- It can (typically) be created by Distributional Word Vectorizing Function.
- It can range from being a Sparse Distributional Word Vector (such as bag-of-words vector) to being a Dense Distributional Word Vector.
- It can range from being a Continuous Distributional Word Vector to being a Discrete Distributional Word Vector.
- It can range from being a Raw Distributional Word Vector to being a Weighted Distributional Word Vector.
- It can range from being a Text Window-based Distributional Word Vector to being a Sentence-based Distributional Word Vector to being a Document-based Distributional Word Vector.
- …
Example(s):
- [0.538, 0.019, ..., 0.830] ⇐ “King”.
- a word2vec Word Vector.
- …
Counter-Example(s):
See: Probabilistic Language Model, Word Co-Occurrence Pattern, Distributional Text Item Vector, Distributional Word Vector Model Creation Task, Lexical Item Feature.

References

2015

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/word_embedding Retrieved:2015-1-31.
- Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words from the vocabulary (and possibly phrases thereof) are mapped to vectors of real numbers in a low dimensional space, relative to the vocabulary size ("continuous space").
  There are several methods for generating this mapping. They include neural networks, dimensionality reduction on the word co-occurrence matrix, and explicit representation in terms of the context in which words appear. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.

2014a

Dec-23-2014 http://radimrehurek.com/2014/12/making-sense-of-word2vec/
- QUOTE: … word2vec, an unsupervised algorithm for learning the meaning behind words. …
  … Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

2014b

http://www.marekrei.com/blog/dont-count-predict/
- For example, to find the similarity between two words, we can represent the contexts as feature vectors and calculate the cosine similarity between their corresponding vectors.

2014c

(Baroni et al., 2014) ⇒ Marco Baroni, Georgiana Dinu, and Germán Kruszewski. (2014). “Don't Count, Predict! a Systematic Comparison of Context-counting Vs. Context-predicting Semantic Vectors."
- QUOTE: A long tradition in computational linguistics has shown that contextual information provides a good approximation to word meaning, since semantically similar words tend to have similar contextual distributions (Miller & Charles, 1991).

2014

(Mikolov et al., 2014) ⇒ Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. (2014). “Distributed Representations of Words and Phrases and their Compositionality.” In: Advances in Neural Information Processing Systems, 26.

2013

(Mikolov et al., 2013a) ⇒ Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space." CoRR, abs/1301.3781, 2013.

2012

(Bordes et al., 2012) ⇒ Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. (2012). “Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing.” In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 127-135. 2012.

2011

(Collobert et al., 2011) ⇒ Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. (2011). “Natural Language Processing (Almost) from Scratch.” In: The Journal of Machine Learning Research, 12.
- QUOTE: Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data.

2010

(Turian et al., 2010) ⇒ Joseph Turian, Lev Ratinov, and Yoshua Bengio. (2010). “Word Representations: A Simple and General Method for Semi-supervised Learning.” In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
- QUOTE: If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking.

2008

(Collobert & Weston, 2008) ⇒ Ronan Collobert, and Jason Weston. (2008). “A unified architecture for natural language processing: Deep neural networks with multitask learning.” In: Proceedings of the 25th International Conference on Machine learning..

2006

(Bengio et al., 2006) ⇒ Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. (2006). “Neural Probabilistic Language Models.” In: Innovations in Machine Learning, D. Holmes and L.C. Jain, eds. doi:10.1007/3-540-33486-6_6

2003

(Bengio et al., 2003a) ⇒ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. (2003). “A Neural Probabilistic Language Model.” In: The Journal of Machine Learning Research, 3.
- QUOTE: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.

1991

(Elman, 1991) ⇒ J.L. Elman. (1991). “Distributed Representations, Simple Recurrent Networks, and Grammatical Structure.” In: Machine Learning, 7(2).

1990

(Deerwester et al., 1990) ⇒ Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. (1990). “Indexing by Latent Semantic Analysis.” In: JASIS, 41(6).
(Pollack, 1990) ⇒ J.B. Pollack. (1990). “Recursive Distributed Representations.” In: Artificial Intelligence, 46(1).

1986

(Hinton, 1986) ⇒ Geoffrey E. Hinton. (1986). “Learning Distributed Representations of Concepts.” In: Proceedings of the eighth annual conference of the cognitive science society.
- QUOTE: Concepts can be represented by distributed patterns of activity in networks of neuron-like units. One advantage of this kind of representation is that it leads to automatic generalization.

1984

(Hinton, 1984) ⇒ Geoffrey E. Hinton. (1984). “Distributed Representations."

Distributional Co-Occurrence Word Vector

References

2015

2014a

2014b

2014c

2014

2013

2012

2011

2010

2008

2006

2003

1991

1990

1986

1984

Navigation menu

Search