Byte-Pair Encoding (BPE) Subword Segmentation Algorithm

A Byte-Pair Encoding (BPE) Subword Segmentation Algorithm is a Subword Segmentation Algorithm that is based on a Byte-Pair Encoding (BPE) Algorithm.

AKA: E Subword Tokenization Algorithm.
Context:
- It usually an unsupervised learning algorithm.
- It merges the most frequent symbol pair into a new symbol.
Example(s):
Counter-Example(s):
See: Subword Unit, Lossless Compression Algorithm, Word Segmentation Algorithm, Entropy Encoding Algorithm, Data Compression Algorithm, Context Tree Weighting (CTW) Algorithm, OOV.

References

2021a

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Byte_pair_encoding Retrieved:2021-5-2.
- Byte pair encoding^[1] or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal. A variant of the technique has shown to be useful in several natural language processing (NLP) applications, such as Google's SentencePiece, and OpenAI's GPT-3.

↑ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.

2021b

(bpemb, 2021) ⇒ What are subword embeddings and why should I use them? Retrieved:2021-5-2.
- QUOTE: Enter Byte-Pair Encoding (BPE) (Sennrich et al, 2016), an unsupervised subword segmentation method. BPE starts with a sequence of symbols, for example characters, and iteratively merges the most frequent symbol pair into a new symbol.
  For example, applying BPE to English might first merge the characters h and e into a new symbol he, then t and h into th, then t and he into the, and so on.
  Learning these merge operations from a large corpus (e.g. all Wikipedia articles in a given language) often yields reasonable subword segementations. For example, a BPE model trained on English Wikipedia splits melfordshire into mel, ford, and shire.

2020

(Bostrom & Durrett, 2020) ⇒ Kaj Bostrom, and Greg Durrett. (2020). “Byte Pair Encoding is Suboptimal for Language Model Pretraining.” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020).
- QUOTE : A BPE vocabulary is constructed as follows:

Algorithm 1 Byte-pair encoding (Sennrich et al., 2016; Gage, 1994)
1: Input: set of strings $D$, target vocab size $k$ 2: procedure BPE$\left(D, k\right)$ 3: $V \leftarrow$ all unique characters in $D$ (about 4,000 in English Wikipedia) 4: while $\vert V \vert < k$ do ⊳ Merge tokens 5: $t_L, t_R \leftarrow$ Most frequent bigram in $D$ 6: $t_{NEW} \leftarrow t_L + t_R$ ⊳ Make new token 7: $V \leftarrow V + [t_{NEW}]$ 8: Replace each occurrence of $t_L$, $t_R$ in $D$ with $t_{NEW}$ 9: end while 10: return $V$ 11: end procedure

:: BPE tokenization takes the vocabulary $V$ containing ordered merges and applies them to new text in the same order as they occurred during vocabulary construction.

2016

(Sennrich et al., 2016) ⇒ Rico Sennrich, Barry Haddow, and Alexandra Birch. (2016). “Neural Machine Translation of Rare Words with Subword Units.” In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016).
- QUOTE: ... Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. We adapt this algorithm for word segmentation. Instead of merging frequent pairs of bytes, we merge characters or character sequences. ...
  Figure 1 shows a toy example of learned BPE operations. At test time, we first split words into sequences of characters, then apply the learned operations to merge the characters into larger, known symbols. This is applicable to any word, and allows for open-vocabulary networks with fixed symbol vocabularies^[1]. In our example, the OOV “lower” would be segmented into “low-er".

**Figure 1:** BPE merge operations learned from dictionary {‘low’, ‘lowest’, ‘newer’, ‘wider’}.

↑ The only symbols that will be unknown at test time are unknown characters, or symbols of which all occurrences in the training text have been merged into larger symbols, like “safeguar", which has all occurrences in our training text merged into “safeguard". We observed no such symbols at test time, but the issue could be easily solved by recursively reversing specific merges until all symbols are known.

2017

(Heinzerling & Strube, 2017) ⇒ Benjamin Heinzerling, and Michael Strube. (2017). “BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.” In: arXiv preprint arXiv:1710.02187.
- QUOTE: ... Byte Pair Encoding is a variable-length encoding that views text as a sequence of symbols and iteratively merges the most frequent symbol pair into a new symbol. E.g., encoding an English text might consist of first merging the most frequent symbol pair $t h$ into a new symbol $th$, then merging the pair $th e$ into the in the next iteration, and so on. The number of merge operations o determines if the resulting encoding mostly creates short character sequences (e.g. $o = 1000$) or if it includes symbols for many frequently occurring words, e.g. o = 30; 000 (cf. Table 1). Since the BPE algorithm works with any sequence of symbols, it requires no preprocessing and can be applied to untokenized text. …

1994

(Gage, 1994) ⇒ Philip Gage (1994). "A New Algorithm for Data Compression". In: C User Journal, 12(2):23–38, February. DOI:10.5555/177910.177914.
- QUOTE: Many compression algorithms replace frequently occurring bit patterns with shorter representations. The simple approach I present is to replace common pairs of bytes by single bytes.
  The algorithm compresses data by finding the most frequently occurring pairs of adjacent bytes in the data and replacing all instances of the pair with a byte that was not in the original data. The algorithm repeats this process until no further compression is possible, either because there are no more frequently occurring pairs or there are no more unused bytes to represent pairs. The algorithm writes out the table of pair substitutions before the packed data.
  ----

[CUsersJ_Gage_1994-1] Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.

[2] The only symbols that will be unknown at test time are unknown characters, or symbols of which all occurrences in the training text have been merged into larger symbols, like “safeguar", which has all occurrences in our training text merged into “safeguard". We observed no such symbols at test time, but the issue could be easily solved by recursively reversing specific merges until all symbols are known.

[1]

[1]

Byte-Pair Encoding (BPE) Subword Segmentation Algorithm

References

2021a

2021b

2020

2016

2017

1994

Navigation menu

Search