SentencePiece Task
Jump to navigation
Jump to search
A SentencePiece Task is a Text Tokenization Task that is Unsupervised Subword Tokenization-Detokenization Task that can produce tokenized texts from raw data.
- AKA: SentencePiece Benchmark Task.
- Context:
- Task Input(s): raw text;
- Task Output(s): tokenized text;
- Task Requirement(s):
- Benchmark Datasets:
- Kyoto Free Translation Task (KFTT) datasets - training, development and test datasets of KFTT consist of 440k, 1166 and 1160 sentences, respectively.
- Benchmark Performance Metrics:
- BLEU score for evaluation the performance of preprocessing English-Japanese Neural Machine Translation Systems.
- Running time for evaluating the performance of subword training and segmentation systems.
- Baseline Models:
- Benchmark Datasets:
- It can be solve by a SentencePiece System that implements a SentencePiece Algorithm included in the SentencePiece Software Library.
- Example(s):
Lang pair | setting (source/target) | vocab. | BLEU |
---|---|---|---|
ja → en | Word model (baseline) | 80k/80k | 28.24 |
SentencePiece | 8k (shared) | 29.55 | |
SentencePiece w/ pre-tok. | 8K (shared) | 29.85 | |
Word/SentencePiece | 80k/8k | 27.24 | |
SentencePiece/Word | 8k/80k | 29.14 | |
en → ja | Word model (baseline) | 80k/80k | 20.06 |
SentencePiece | 8k (shared) | 21.62 | |
SentencePiece w/ pre-tok. | 8k (shared) | 20.86 | |
Word/SentencePiece | 80k/8k | 21.41 | |
SentencePiece/Word | 8k/80k | 19.94 |
Task | Tool | Pre-tok. | time (sec.) | |
---|---|---|---|---|
Japanese | English | |||
Train | subword-nmt | yes | 56.9 | 54.1 |
SentencePiece | yes | 10.1 | 16.8 | |
subword-nmt | no | 528.0 | 94.7 | |
SentencePiece | no | 217.3 | 21.8 | |
Seg. | subword-nmt | yes | 23.7 | 28.6 |
SentencePiece | yes | 8.2 | 20.3 | |
subword-nmt | no | 216.2 | 36.1 | |
SentencePiece | no | 5.9 | 20.3 | |
Pre-tokenizaion KyTea(ja)/Moses(en) | 24.6 | 15.8 |
- Counter-Example(s):
- See: Subword Neural Machine Translation, Moses, Kyoto Free Translation Task (KFTT), Kyoto Text Analysis Toolkit (KyTea), Neural Machine Translation Task, Neural Text Generation Task, Natural Language Processing Task, Neural Encoder-Decoder Task, SentencePiece Python API, SentencePiece C++ API, SentencePiece TensorFlow API.
References
2020
- (GitHub, 2020) ⇒ https://github.com/google/sentencepiece Retrieved:2020-05-17.
- QUOTE: SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) Sennrich et al.) and unigram language model Kudo.) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
2018
- (Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
- QUOTE: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. (...)SentencePiece comprises four main components: Normalizer, Trainer, Encoder, and Decoder. Normalizer is a module to normalize semantically-equivalent Unicode characters into canonical forms. Trainer trains the subword segmentation model from the normalized corpus. We specify a type of subword model as the parameter of Trainer. Encoder internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer. Decoder converts the subword sequence into the normalized text.
The roles of Encoder and Decoder correspond to preprocessing (tokenization) and postprocessing (detokenization) respectively. However, we call them encoding and decoding as SentencePiece manages the vocabulary to id mapping and can directly convert the text into an id sequence and vice versa. Direct encoding and decoding to / from id sequences are useful for most of NMT systems as their input and output are id sequences.
- QUOTE: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.