Text-String Likelihood Scoring Function Training Algorithm

AKA: Language Modeling Method.
Context:
- It can range from being a Character-Level Language Modeling Algorithm to being a Word-Level Language Modeling Algorithm.
Example(s):
- a Neural-based LM Algorithm, such as an RNN-based LM algorithm or a Transformer-based LM algorithm.
- an MaxLikelihood (MLE)-based LM Algorithm.
- …
Counter-Example(s):
- a Text String Classification Function Generation Algorithm.
See: DNA String Probability Function Generation Algorithm, Distributional Text-Item Model Training Algorithm.

References

(Goodman, 2001) ⇒ Joshua T. Goodman. (2001). “A Bit of Progress in Language Modeling.” In: Computer Speech & Language, 15(4). doi:10.1006/csla.2001.0174
- QUOTE: The goal of a language model is to determine the probability of a word sequence [math]\displaystyle{ w_1...w_n, P (w_1...w_n) }[/math]. This probability is typically broken down into its component probabilities: : [math]\displaystyle{ P (w_1...w_i) = P (w_1) × P (w_2 \mid w_1) ×... × P (w_i \mid w_1...w_{i−1}) }[/math] Since it may be difficult to compute a probability of the form [math]\displaystyle{ P(w_i \mid w_1...w_{i−1}) }[/math] for large i, we typically assume that the probability of a word depends on only the two previous words, the trigram assumption: : [math]\displaystyle{ P (w_i \mid w_1...w_{i−1}) ≈ P (w_i \mid w_i−2w_{i−1}) }[/math] which has been shown to work well in practice. The trigram probabilities can then be estimated from their counts in a training corpus. We let [math]\displaystyle{ C (w_i−2w_{i−1}w_i) }[/math] represent the number of occurrences of [math]\displaystyle{ w_i−2w_{i−1}w_i }[/math] in our training corpus, and similarly for [math]\displaystyle{ C (w_i−2w_{i−1}) }[/math]. Then, we can approximate: : [math]\displaystyle{ P (w_i \mid w_i−2w_{i−1}) ≈ C (w_i−2w_{i−1}w_i) C (w_i−2w_{i−1}) }[/math] Unfortunately, in general this approximation will be very noisy, because there are many three word sequences that never occur.