SentencePiece Tokenization System

A SentencePiece Tokenization System is an unsupervised subword tokenization system and unsupervised subword detokenization system.

Context:
- It can have performance reported in (Kudo & Richardson, 2018).
- Resource(s):
  - software is available at https://github.com/google/sentencepiece
- System's Architecture:
  - It can (typically) consist of 3 system-modules:
    - SentencePiece Normalizer - normalizes semantically-equivalent Unicode characters into canonical forms.
    - SentencePiece Trainer - trains a subword segmentation model from a normalized corpus;
    - SentencePiece Encoder - executes Normalizer (to normalize input data) and tokenizes normalized input into a subword sequence with a subword model trained by the Trainer;
    - SentencePiece Decoder - converts the subword sequence into the normalized text.
- Training and other ML Tools :
  - It can use a SentencePiece Trainer that implements fast subword training and segmentation algorithms to produce lossless tokenization from a large raw datasets.
  - It can use a SentencePiece self-contained model file containing normalization rules, segmentation parameters and pre-compiled finite state transducer for character normalization.
  - It uses Moses tokenizer and KyTea tools for English and Japanese pre-tokenization.
- It can solve can solve a SentencePiece Task by implementing a SentencePiece Algorithms.
- It can be designed for Language-independent Neural Text Processing Task including Neural Machine Translation.
- It can provide open-source C++ and Python implementations for subword units such as Byte-Pair-Encoding (BPE).
- It can implement the Kudo's Unigram Language Model.
- … SentencePiece Python API;
- … SentencePiece C++ API;
- … SentencePiece TensorFlow API.
- ...
Example(s):
- SentencePiece, v0.1.92 [1] (~2020/06/08).
  - SentencePiece("I bought a real time operating system") ⇒ '▁I', '▁bo', 'ught', '▁a', '▁real', '▁time', '▁oper', 'ating', '▁system' .
  - SentencePiece(""[[software engineering framework]] of [[Technical Debt Measure|technical debt]]") ⇒ <code> '▁[[', 'software', '▁engineering', '▁framework', ']]', '▁of', '▁<', 'i', '>[[', 'Te', 'chn', 'ical', '▁D', 'eb', 't', '▁Measure', '|', 'techn', 'ical', '▁deb', 't', ']]' .
  - SentencePiece("A [[Supervised Learning Act]] is a [[learning act]]") ⇒ '▁A', '▁[[', 'Supervised', '▁Learning', '▁Act', ']]', '▁is▁a', '▁[[', 'learning', '▁act', ']]' .
  - SentencePiece("Weres th eeror?”) ⇒ '▁W', 'er', 'es', '▁th', '▁e', 'er', 'or', '?'.
  - SentencePiece("ilike yester days weaather.”)") ⇒ '▁', 'il', 'ike', '▁y', 'ester', '▁days', '▁we', 'a', 'ather', '.'.
- …
Counter-Example(s):
See: Neural Machine Translation, Neural Text Generation System, Neural Encoder-Decoder, Subword Neural Machine Translation.

References

2020

(GitHub, 2020) ⇒ https://github.com/google/sentencepiece Retrieved:2020-05-17.
- QUOTE: SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) Sennrich et al.) and unigram language model Kudo.) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

2018

(Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
- QUOTE: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.
  (...)
  SentencePiece comprises four main components: Normalizer, Trainer, Encoder, and Decoder. Normalizer is a module to normalize semantically-equivalent Unicode characters into canonical forms. Trainer trains the subword segmentation model from the normalized corpus. We specify a type of subword model as the parameter of Trainer. Encoder internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer. Decoder converts the subword sequence into the normalized text.
  The roles of Encoder and Decoder correspond to preprocessing (tokenization) and postprocessing (detokenization) respectively. However, we call them encoding and decoding as SentencePiece manages the vocabulary to id mapping and can directly convert the text into an id sequence and vice versa. Direct encoding and decoding to / from id sequences are useful for most of NMT systems as their input and output are id sequences.