Text-Substring Probability Function
A Text-Substring Probability Function is a string probability function that assigns a likelihood score (possibly a probability) of a text sub-string.
- AKA: Language Model Probability Function.
- Context:
- Function Range: a Text String Probability.
- It can be produced by a Text String Probability Function Generation Task that is solved by a Text String Probability Function Generation system.
- It can be a probability function of a Language Model.
- Example(s):
- [math]\displaystyle{ f(\text{This is a phrase}) \Rightarrow 0.00014 }[/math].
- [math]\displaystyle{ f }[/math](“
A language model is a predictive model that assigns a probability
”) ⇒ [math]\displaystyle{ 0.0183 }[/math], - ...
- …
- Counter-Example(s):
- See: Bag-of-Words Vector, Natural Language Inference Task, Natural Language Understanding Task, Natural Language Processing Task, Probability Distribution.
References
2018
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/language_model Retrieved:2018-4-8.
- A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability [math]\displaystyle{ P(w_1,\ldots,w_m) }[/math] to the whole sequence. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications, especially ones that generate text as an output. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.
In speech recognition, the computer tries to match sounds with word sequences. The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases "recognize speech" and "wreck a nice beach" are pronounced almost the same but mean very different things. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.
Language models are used in information retrieval in the query likelihood model. Here a separate language model is associated with each document in a collection. Documents are ranked based on the probability of the query Q in the document's language model [math]\displaystyle{ P(Q\mid M_d) }[/math] . Commonly, the unigram language model is used for this purpose— otherwise known as the bag of words model.
Data sparsity is a major problem in building language models. Most possible word sequences will not be observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. This is known as an n-gram model or unigram model when n = 1.
- A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability [math]\displaystyle{ P(w_1,\ldots,w_m) }[/math] to the whole sequence. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications, especially ones that generate text as an output. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.
2013
- (Chelba et al., 2013) ⇒ Ciprian Chelba, Tomáš Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. (2013). “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling." Technical Report, Google Research.