SentencePiece Task

From GM-RKB
Jump to navigation Jump to search

A SentencePiece Task is a Text Tokenization Task that is Unsupervised Subword Tokenization-Detokenization Task that can produce tokenized texts from raw data.

Table 1
Lang pair setting (source/target) vocab. BLEU
ja → en Word model (baseline) 80k/80k 28.24
SentencePiece 8k (shared) 29.55
SentencePiece w/ pre-tok. 8K (shared) 29.85
Word/SentencePiece 80k/8k 27.24
SentencePiece/Word 8k/80k 29.14
en → ja Word model (baseline) 80k/80k 20.06
SentencePiece 8k (shared) 21.62
SentencePiece w/ pre-tok. 8k (shared) 20.86
Word/SentencePiece 80k/8k 21.41
SentencePiece/Word 8k/80k 19.94
Table 2
Task Tool Pre-tok. time (sec.)
Japanese English
Train subword-nmt yes 56.9 54.1
SentencePiece yes 10.1 16.8
subword-nmt no 528.0 94.7
SentencePiece no 217.3 21.8
Seg. subword-nmt yes 23.7 28.6
SentencePiece yes 8.2 20.3
subword-nmt no 216.2 36.1
SentencePiece no 5.9 20.3
Pre-tokenizaion KyTea(ja)/Moses(en) 24.6 15.8


References

2020

2018