Subword Sequence
(Redirected from subword sequence)
Jump to navigation
Jump to search
A Subword Sequence is a symbol string composed of Subword units.
References
2018
- (Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
- QUOTE: For the sake of clarity, SentencePiece first escapes the whitespace with a meta symbol _ (U+2581), and tokenizes the input into an arbitrary subword sequence, for example:
- Raw text: Hello_world.
- Tokenized: [Hello] [_wor] [ld] [.]
- QUOTE: For the sake of clarity, SentencePiece first escapes the whitespace with a meta symbol _ (U+2581), and tokenizes the input into an arbitrary subword sequence, for example: