Subword Unit
(Redirected from subword unit)
Jump to navigation
Jump to search
A Subword Unit is a substring composed of grapheme characters.
- Context:
- It can be identified by a Subword Segmentation System, such as SentencePiece (that solves a subword segmentation task).
- …
- Example(s):
ught
, can be a subword within the passage "... bought ...", according to SentencePiece when trained by a GM-RKB snapshot.techn
, can be a subword within the passage "... technical", according to SentencePiece when trained by a GM-RKB snapshot.chn
, can be a subword within the passage "... Technnical", according to SentencePiece when trained by a GM-RKB snapshot.▁play
, can be a subword within the passage "... playing engineering ...", according to SentencePiece when trained by a GM-RKB snapshot (because the word "play" appears frequently enough in the corpus).▁engineering
, can be a subword within the passage "... playing engineering ...", according to SentencePiece when trained by a GM-RKB snapshot (because the word "engineer" does not appear frequently enough in the corpus).▁is▁a
, can be a subword within the passage "... is a ...", according to SentencePiece when trained by a GM-RKB snapshot (likely because the phrase is used in nearly all pages in the corpus).>[[
, can be a subword within the passage "... <i>[[Technical ...", according to SentencePiece when trained by a GM-RKB snapshot.- …
- Counter-Example(s):
- See: Rare Word, n-Gram, Byte Pair Encoding (BPE).
References
2018a
- (Kudo, 2018) ⇒ Taku Kudo. (2018). “Subword Regularization:Improving Neural Network Translation Models with Multiple Subword Candidates". In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) Volume 1: Long Papers. DOI:10.18653/v1/P18-1007.
- QUOTE: ... Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. ...
... A common approach for dealing with the open vocabulary issue is to break up rare words into subword units (Schuster & Nakajima, 2012; Chitnis & DeNero, 2015; Sennrich et al., 2016; Wu et al et., 2016).
- QUOTE: ... Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. ...
2018b
- (Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
2016
- (Sennrich et al., 2016) ⇒ Rico Sennrich, Barry Haddow, and Alexandra Birch. (2016). “Neural Machine Translation of Rare Words with Subword Units.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016).
- QUOTE: ... A simple method to manipulate the trade-off between vocabulary size and text size is to use shortlists of unsegmented words, using subword units only for rare words. As an alternative, we propose a segmentation algorithm based on byte pair encoding (BPE), which lets us learn a vocabulary that provides a good compression rate of the text. ...
1989
- (Mignosi, 1989) ⇒ Filippo Mignosi. (1989). “Infinite Words with Linear Subword Complexity.” In: Theoretical Computer Science, 65(2).
- QUOTE: ... Let $A$ be a set and let $A^{*}$ be the free monoid generated by $A$. The elements of $A^{*}$ are said to be words. The empty word is denoted by $\wedge$ and we set $A^{+}:=A^{*} \backslash\{1\} .$ Let $A^{m}$ be the set of all words of $A^{+}$ of length $m$ and denote by $|u|$ the length of the word $u$. An infinite word over $A$ is a sequence of elements in $A^{+} ;$ its length is $+\infty .$ The set $A$ is called an alphabet and accordingly, elements of $A$ are called letters. Now a word which is not a power of another word is called primitive. Let $f=a_{1} a_{2} \ldots$ be an infinite (or a finite) word. A word $w$ is called a subword of $f(\text { but also a factor })$ if $w=\wedge$ or $w=a_{i} a_{i+1} \ldots a_{j}, i, j \in \mathbb{N}, i \leqslant j \leqslant|f| .$ We set for short $w \mid f .$ Let $F:-F(f)$ be the set of the subwords of $f$.