Subword-level Language Model

From GM-RKB

Jump to navigation Jump to search

A Subword-level Language Model is a Language Model that operates at a Subword units level.

Example(s):
- SentencePiece Subword Tokenizer and Detokenizer,
- ...
- …
Counter-Example(s):
See: SentencePiece, Subword Embedding System, Word/Token Embedding Space, OOV Word, Language Model, Word Embedding System, Natural Language Processing System, Natural Language, Semantic Word Similarity, Seq2Seq Neural Network.

References

2018a

(Kudo, 2018) ⇒ Taku Kudo. (2018). “Subword Regularization:Improving Neural Network Translation Models with Multiple Subword Candidates". In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) Volume 1: Long Papers. DOI:10.18653/v1/P18-1007.
- QUOTE: ... Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. ...
  ... A common approach for dealing with the open vocabulary issue is to break up rare words into subword units (Schuster & Nakajima, 2012; Chitnis & DeNero, 2015; Sennrich et al., 2016; Wu et al et., 2016).

2018b

(Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.

2016

(Sennrich et al., 2016) ⇒ Rico Sennrich, Barry Haddow, and Alexandra Birch. (2016). “Neural Machine Translation of Rare Words with Subword Units.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016).
- QUOTE: ... A simple method to manipulate the trade-off between vocabulary size and text size is to use shortlists of unsegmented words, using subword units only for rare words. As an alternative, we propose a segmentation algorithm based on byte pair encoding (BPE), which lets us learn a vocabulary that provides a good compression rate of the text. ...

1989

(Mignosi, 1989) ⇒ Filippo Mignosi. (1989). “Infinite Words with Linear Subword Complexity.” In: Theoretical Computer Science, 65(2).
- QUOTE: ... Let $A$ be a set and let $A^{*}$ be the free monoid generated by $A$. The elements of $A^{*}$ are said to be words. The empty word is denoted by $\wedge$ and we set $A^{+}:=A^{*} \backslash\{1\} .$ Let $A^{m}$ be the set of all words of $A^{+}$ of length $m$ and denote by $|u|$ the length of the word $u$. An infinite word over $A$ is a sequence of elements in $A^{+} ;$ its length is $+\infty .$ The set $A$ is called an alphabet and accordingly, elements of $A$ are called letters. Now a word which is not a power of another word is called primitive. Let $f=a_{1} a_{2} \ldots$ be an infinite (or a finite) word. A word $w$ is called a subword of $f(\text { but also a factor })$ if $w=\wedge$ or $w=a_{i} a_{i+1} \ldots a_{j}, i, j \in \mathbb{N}, i \leqslant j \leqslant|f| .$ We set for short $w \mid f .$ Let $F:-F(f)$ be the set of the subwords of $f$.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Subword-level_Language_Model&oldid=880914"