SentencePiece Task

A SentencePiece Task is a Text Tokenization Task that is Unsupervised Subword Tokenization-Detokenization Task that can produce tokenized texts from raw data.

AKA: SentencePiece Benchmark Task.
Context:
- Task Input(s): raw text;
- Task Output(s): tokenized text;
- Task Requirement(s):
  - Benchmark Datasets:
    - Kyoto Free Translation Task (KFTT) datasets - training, development and test datasets of KFTT consist of 440k, 1166 and 1160 sentences, respectively.
  - Benchmark Performance Metrics:
    - BLEU score for evaluation the performance of preprocessing English-Japanese Neural Machine Translation Systems.
    - Running time for evaluating the performance of subword training and segmentation systems.
  - Baseline Models:
    - Word Model;
    - SentencePiece unigram language model with pre-tokentization (a NMT configuration with subword-nmt);
    - SentencePiece unigram language model without pre-tokenization.
- It can be solve by a SentencePiece System that implements a SentencePiece Algorithm included in the SentencePiece Software Library.
Example(s):
- Preprocessing Performance Validation.
  - English-Japanese translation performance (Kudo & Richardson, 2018): Table 1.
- Segmentation Performance.
  - Subword training and segmentation performance (Kudo & Richardson, 2018): Table 2

Table 1
Lang pair	setting (source/target)	vocab.	BLEU
ja → en	Word model (baseline)	80k/80k	28.24
	SentencePiece	8k (shared)	29.55
	SentencePiece w/ pre-tok.	8K (shared)	29.85
	Word/SentencePiece	80k/8k	27.24
	SentencePiece/Word	8k/80k	29.14
en → ja	Word model (baseline)	80k/80k	20.06
	SentencePiece	8k (shared)	21.62
	SentencePiece w/ pre-tok.	8k (shared)	20.86
	Word/SentencePiece	80k/8k	21.41
	SentencePiece/Word	8k/80k	19.94

Table 2
Task	Tool	Pre-tok.	time (sec.)
Task	Tool	Pre-tok.	Japanese	English
Train	subword-nmt	yes	56.9	54.1
	SentencePiece	yes	10.1	16.8
	subword-nmt	no	528.0	94.7
	SentencePiece	no	217.3	21.8
Seg.	subword-nmt	yes	23.7	28.6
	SentencePiece	yes	8.2	20.3
	subword-nmt	no	216.2	36.1
	SentencePiece	no	5.9	20.3
Pre-tokenizaion KyTea(ja)/Moses(en)			24.6	15.8

References

2020

(GitHub, 2020) ⇒ https://github.com/google/sentencepiece Retrieved:2020-05-17.
- QUOTE: SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) Sennrich et al.) and unigram language model Kudo.) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

2018

(Kudo & Richardson, 2018) ⇒ Taku Kudo, and John Richardson. (2018). “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) System Demonstrations. DOI:10.18653/v1/d18-2012.
- QUOTE: This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.
  (...)
  SentencePiece comprises four main components: Normalizer, Trainer, Encoder, and Decoder. Normalizer is a module to normalize semantically-equivalent Unicode characters into canonical forms. Trainer trains the subword segmentation model from the normalized corpus. We specify a type of subword model as the parameter of Trainer. Encoder internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer. Decoder converts the subword sequence into the normalized text.
  The roles of Encoder and Decoder correspond to preprocessing (tokenization) and postprocessing (detokenization) respectively. However, we call them encoding and decoding as SentencePiece manages the vocabulary to id mapping and can directly convert the text into an id sequence and vice versa. Direct encoding and decoding to / from id sequences are useful for most of NMT systems as their input and output are id sequences.

SentencePiece Task

References

2020

2018

Navigation menu

Search