Word/Token-Level Language Model
(Redirected from word/token LM)
Jump to navigation
Jump to search
A Word/Token-Level Language Model is a language model that operates at a text token-level.
- AKA: Joint Probability Function for Words.
- Context:
- It can range from being a Forward Word-Level Language Model to being a Backward Word-Level Language Model to being a Bi-Directional Word-Level Language Model.
- It can be produced by a Word-Level Language Model Training System (solving by a word-level LM task).
- It can range from being a Unigram Token-Level Language Model to being a n-Gram Token-Level Language Model (such as a Bigram Token-Level Language Model or a Trigram Token-Level Language Model).
- …
- Example(s):
- [math]\displaystyle{ f(\text{This is a phrase}) \Rightarrow P(\text{This}) \times P(\text{is} \mid \text{This}) \times P(\text{a} \mid \text{This, is}) \times P(\text{phrase} \mid \text{is, a}) \Rightarrow 0.00014 }[/math].
- the one used by https://mesotron.shinyapps.io/languagemodel/
- a Neural-based Word Token-Level Language Model.
- an MLE-based Word/Token-Level Language Model.
- …
- Counter-Example(s):
- See: Word/Token Embedding Space, OOV Word, Language Model, Word Embedding System, Natural Language Processing System, Natural Language, Semantic Word Similarity, Seq2Seq Neural Network.
References
2003
- (Bengio et al., 2003a) ⇒ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. (2003). “A Neural Probabilistic Language Model.” In: The Journal of Machine Learning Research, 3.
- QUOTE: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.
2001
- (Goodman, 2001) ⇒ Joshua T. Goodman. (2001). “A Bit of Progress in Language Modeling.” In: Computer Speech & Language, 15(4). doi:10.1006/csla.2001.0174
- QUOTE: The goal of a language model is to determine the probability of a word sequence [math]\displaystyle{ w_1...w_n, P (w_1...w_n) }[/math]. This probability is typically broken down into its component probabilities: : [math]\displaystyle{ P (w_1...w_i) = P (w_1) × P (w_2 \mid w_1) ×... × P (w_i \mid w_1...w_{i−1}) }[/math] Since it may be difficult to compute a probability of the form [math]\displaystyle{ P(w_i \mid w_1...w_{i−1}) }[/math] for large i, we typically assume that the probability of a word depends on only the two previous words, the trigram assumption: : [math]\displaystyle{ P (w_i \mid w_1...w_{i−1}) ≈ P (w_i \mid w_i−2w_{i−1}) }[/math] which has been shown to work well in practice.