Subword Tokenization Task
(Redirected from subword tokenization)
Jump to navigation
Jump to search
An Subword Tokenization Task is a text segmentation task that requires the detection of subword units.
- AKA: Sub-Word Segmentation.
- Context:
- It can be solved by a Subword Tokenization System (that implements a subword tokenization algorithm).
- It can be an NLP Preprocessing Task, to help to handle Out-of-Vocabulary Words and Rare Words.
- …
- Example(s):
- as applied in Kudo & Richardson (2018), before Neural machine translation.
- SWS("Abwasserbehandlungsanlage") ⇒ ["Abwasser", "behandlungs", "anlage"].
- SWS("폐수처리장") ⇒ ["폐수", "처리", "장"]
- SWS("sewage water treatment plant") ⇒ ["sew", "age", " ", "water", " ", "treat", "ment", " ", "plant"]
- …
- Counter-Example(s):
- Orthographic Tokenization, based on a set of predetermined rules.
- See: SentencePiece, Subword Unit, Error Correction Task, Natural Language Processing, Artificial Error Generation.
References
2018
- (Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
2016
- (Sennrich et al., 2016) ⇒ Rico Sennrich, Barry Haddow, and Alexandra Birch. (2016). “Neural Machine Translation of Rare Words with Subword Units.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016).
- QUOTE: ... One technical difference from our work is that the attention mechanism still operates on the level of words in the model by Ling|Ling et al. (2015b), and that the representation of each word is fixed-length. We expect that the attention mechanism benefits from our variable-length representation: the network can learn to place attention on different subword units at each step. Recall our introductory example
Abwasserbehandlungsanlange
, for which a subword segmentation avoids the information bottleneck of a fixed-length representation.
- QUOTE: ... One technical difference from our work is that the attention mechanism still operates on the level of words in the model by Ling|Ling et al. (2015b), and that the representation of each word is fixed-length. We expect that the attention mechanism benefits from our variable-length representation: the network can learn to place attention on different subword units at each step. Recall our introductory example