Subword Tokenization Task

From GM-RKB

(Redirected from Subword Segmentation)

Jump to navigation Jump to search

An Subword Tokenization Task is a text segmentation task that requires the detection of subword units.

AKA: Sub-Word Segmentation.
Context:
- It can be solved by a Subword Tokenization System (that implements a subword tokenization algorithm).
- It can be an NLP Preprocessing Task, to help to handle Out-of-Vocabulary Words and Rare Words.
- …
Example(s):
- as applied in Kudo & Richardson (2018), before Neural machine translation.
- SWS("Abwasserbehandlungsanlage") ⇒ ["Abwasser", "behandlungs", "anlage"].
- SWS("폐수처리장") ⇒ ["폐수", "처리", "장"]
- SWS("sewage water treatment plant") ⇒ ["sew", "age", " ", "water", " ", "treat", "ment", " ", "plant"]
- …
Counter-Example(s):
Orthographic Tokenization, based on a set of predetermined rules.
See: SentencePiece, Subword Unit, Error Correction Task, Natural Language Processing, Artificial Error Generation.

References

2018

(Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.

2016

(Sennrich et al., 2016) ⇒ Rico Sennrich, Barry Haddow, and Alexandra Birch. (2016). “Neural Machine Translation of Rare Words with Subword Units.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016).
- QUOTE: ... One technical difference from our work is that the attention mechanism still operates on the level of words in the model by Ling|Ling et al. (2015b), and that the representation of each word is fixed-length. We expect that the attention mechanism benefits from our variable-length representation: the network can learn to place attention on different subword units at each step. Recall our introductory example Abwasserbehandlungsanlange, for which a subword segmentation avoids the information bottleneck of a fixed-length representation.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Subword_Tokenization_Task&oldid=880844"