Text-String Likelihood Scoring Function Training Algorithm
(Redirected from LM learning method)
Jump to navigation
Jump to search
A Text-String Likelihood Scoring Function Training Algorithm is a probability function generation algorithm that can be implemented by a text string probability function generation system (to solve a text string probability function generation task).
- AKA: Language Modeling Method.
- Context:
- It can range from being a Character-Level Language Modeling Algorithm to being a Word-Level Language Modeling Algorithm.
- Example(s):
- a Neural-based LM Algorithm, such as an RNN-based LM algorithm or a Transformer-based LM algorithm.
- an MaxLikelihood (MLE)-based LM Algorithm.
- …
- Counter-Example(s):
- See: DNA String Probability Function Generation Algorithm, Distributional Text-Item Model Training Algorithm.
References
2013
- (Chelba et al., 2013) ⇒ Ciprian Chelba, Tomáš Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. (2013). “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling." Technical Report, Google Research.
- QUOTE: ... We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. ...
2001
- (Goodman, 2001) ⇒ Joshua T. Goodman. (2001). “A Bit of Progress in Language Modeling.” In: Computer Speech & Language, 15(4). doi:10.1006/csla.2001.0174
- QUOTE: The goal of a language model is to determine the probability of a word sequence [math]\displaystyle{ w_1...w_n, P (w_1...w_n) }[/math]. This probability is typically broken down into its component probabilities: : [math]\displaystyle{ P (w_1...w_i) = P (w_1) × P (w_2 \mid w_1) ×... × P (w_i \mid w_1...w_{i−1}) }[/math] Since it may be difficult to compute a probability of the form [math]\displaystyle{ P(w_i \mid w_1...w_{i−1}) }[/math] for large i, we typically assume that the probability of a word depends on only the two previous words, the trigram assumption: : [math]\displaystyle{ P (w_i \mid w_1...w_{i−1}) ≈ P (w_i \mid w_i−2w_{i−1}) }[/math] which has been shown to work well in practice. The trigram probabilities can then be estimated from their counts in a training corpus. We let [math]\displaystyle{ C (w_i−2w_{i−1}w_i) }[/math] represent the number of occurrences of [math]\displaystyle{ w_i−2w_{i−1}w_i }[/math] in our training corpus, and similarly for [math]\displaystyle{ C (w_i−2w_{i−1}) }[/math]. Then, we can approximate: : [math]\displaystyle{ P (w_i \mid w_i−2w_{i−1}) ≈ C (w_i−2w_{i−1}w_i) C (w_i−2w_{i−1}) }[/math] Unfortunately, in general this approximation will be very noisy, because there are many three word sequences that never occur.