Maximum Likelihood Estimation (MLE)-based Language Model

AKA: n-Gram-based Text String Probability Function.
Context:
- It can be mathematically represented by the statistical model $\mathcal{M}(\mathcal{S}, \mathcal{P})$ where:
  - $\mathcal{S}$ is a set of all possible sequences of language model units (e.g. characters, words, strings) with a vocabulary $\mathcal{V}$.
  - $\mathcal{P}$ is a probability distribution on $\mathcal{S}$ that is MLE estimate.
Example(s):
- a N-Gram Language Model,
- a Text-substring Likelihood Model,
- A Unigram Language Model for the English Language.
  - http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
  - Zipf,
- a GM-RKB Character-level MLE Language Model.
- …
Counter-Example(s):
- a Distributional Word Similarity Function.
- a Neural Network-based Language Model.
See: Term Relative Frequency Function.

References

(Kuznetsov et al., 2016) ⇒ Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, and Brian Roark. (2016). “Learning N-Gram Language Models from Uncertain Data”. In: INTERSPEECH, pp. 2323-2327.
- ABSTRACT: We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semi-supervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz backoff model. We compare semi-supervised adaptation of language models for YouTube video speech recognition in two conditions: when using full lattices with our new algorithm versus just the 1-best output from the baseline speech recognizer. Unlike 1-best methods, the new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of channels. We show that channels with the most data yielded the largest gains. The algorithm was implemented via a new semiring in the OpenFst library and will be released as part of the OpenGrm ngram library.

(Croft & Lafferty, 2003) ⇒ W. Bruce Croft, and John D. Lafferty, editors. (2003). “Language Modeling for Information Retrieval." Kluwer Academic.
- QUOTE: A statistical language model, or more simple a language model, is a probabilistic mechanism for generating text. Such a definition is general enough to include an endless variety of schemes. However, a distinction should be made between generative models, which can in principle be used to synthesize artificial text, and discriminative techniques to classify text into predefined categories.
  In the past several years a new framework for information retrieval has emerged that is based on statistical language modeling. The approach differs from traditional probabilistic approaches in interesting and subtle ways, and is fundamentally different from vector space methods. It is string that the language modeling approach to information retrieval was not proposed until the late 1990s; however, until recently the information retrieval and language modeling research communities were somewhat isolated.

(Jacquemin, 2001) ⇒ Christian Jacquemin. (2001). “Spotting and Discovering Terms Through Natural Language Processing." MIT Press. ISBN:0262100851
- QUOTE: n-gram language model: In the statistical analysis of a document, the model of language is an n-gram model if the elementary events are n-uples of words. If [math]\displaystyle{ n }[/math] equals 2 or 3, the model is said bi- or tri-gram.

(W3C, 2001) ⇒ (2001). "Stochastic Language Models (N-Gram) Specification". In: W3C Working Draft 3 January 2001. Michael K. Brown, Andreas Kellner, Dave Raggett (eds).
- QUOTE: N-Gram language models are traditionally used in large vocabulary speech recognition systems to provide the recognizer with an a-priori likelihood P(W) of a given word sequence W. The N-Gram language model is usually derived from large training texts that share the same language characteristics as expected input. This information complements the acoustic model P(W|O) that models the articulatory features of the speakers. Together, these two components allow a system to compute the most likely input sequence W' = argmaxW P(W|O), where O is the input signal observations as W' = argmaxW P(O|W) P(W).
  In contrast, N-Gram language models rely on the likelihood of sequences of words, such as word pairs (in the case of bigrams) or word triples (in the case of trigrams) and are therefore less restrictive. The use of stochastic N-Gram models has a long and successful history in the research community and is now more and more effecting commercial systems, as the market asks for more robust and flexible solutions.

(Stanley et al., 1996) ⇒ Stanley F. Chen, and Joshua Goodman. (1996). “Empirical Study of Smoothing Techniques for Language Modeling". In: Proceedings of the 34th Annual Meeting of the ACL (ACL 1996).
- QUOTE: We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.

(Brown et al, 1992) ⇒ Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. (1992). “Class-based N-gram Models of Natural Language.” In: Computational Linguistics, 18(4).

↑ For probabilistic models, normalizing means dividing by some total count so that the resulting probabilities fall legally between 0 and 1.