Subword Unit

Context:
- It can be identified by a Subword Segmentation System, such as SentencePiece (that solves a subword segmentation task).
- …
Example(s):
- ught, can be a subword within the passage "... bought ...", according to SentencePiece when trained by a GM-RKB snapshot.
- techn, can be a subword within the passage "... technical", according to SentencePiece when trained by a GM-RKB snapshot.
- chn, can be a subword within the passage "... Technnical", according to SentencePiece when trained by a GM-RKB snapshot.
- ▁play, can be a subword within the passage "... playing engineering ...", according to SentencePiece when trained by a GM-RKB snapshot (because the word "play" appears frequently enough in the corpus).
- ▁engineering, can be a subword within the passage "... playing engineering ...", according to SentencePiece when trained by a GM-RKB snapshot (because the word "engineer" does not appear frequently enough in the corpus).
- ▁is▁a, can be a subword within the passage "... is a ...", according to SentencePiece when trained by a GM-RKB snapshot (likely because the phrase is used in nearly all pages in the corpus).
- >[[, can be a subword within the passage "... <i>[[Technical ...", according to SentencePiece when trained by a GM-RKB snapshot.
- …
Counter-Example(s):
- a Word.
- a Grapheme.
See: Rare Word, n-Gram, Byte Pair Encoding (BPE).

References

(Mignosi, 1989) ⇒ Filippo Mignosi. (1989). “Infinite Words with Linear Subword Complexity.” In: Theoretical Computer Science, 65(2).
- QUOTE: ... Let $A$ be a set and let $A^{*}$ be the free monoid generated by $A$. The elements of $A^{*}$ are said to be words. The empty word is denoted by $\wedge$ and we set $A^{+}:=A^{*} \backslash\{1\} .$ Let $A^{m}$ be the set of all words of $A^{+}$ of length $m$ and denote by $|u|$ the length of the word $u$. An infinite word over $A$ is a sequence of elements in $A^{+} ;$ its length is $+\infty .$ The set $A$ is called an alphabet and accordingly, elements of $A$ are called letters. Now a word which is not a power of another word is called primitive. Let $f=a_{1} a_{2} \ldots$ be an infinite (or a finite) word. A word $w$ is called a subword of $f(\text { but also a factor })$ if $w=\wedge$ or $w=a_{i} a_{i+1} \ldots a_{j}, i, j \in \mathbb{N}, i \leqslant j \leqslant|f| .$ We set for short $w \mid f .$ Let $F:-F(f)$ be the set of the subwords of $f$.