Byte-Pair Encoding (BPE) Tokenization System
Jump to navigation
Jump to search
A Byte-Pair Encoding (BPE) Tokenization System is a dictionary encoding system that implements a BPE-based tokenization algorithm to solve a BPE-based tokenization task (to replace frequent byte-pairs with a single unused byte).
- Context:
- …
- Example(s):
- Counter-Example(s):
- …
- See: Subword Unit, Lossless Compression Algorithm, Subword Tokenization Algorithm, Word Segmentation Algorithm, Entropy Encoding Algorithm, Data Compression Algorithm, Context Tree Weighting (CTW) Algorithm.
References
2022
- https://dugas.ch/artificial_curiosity/GPT_architecture.html
- QUOTE:
- Note: For efficiency, GPT-3 actually uses byte-level Byte Pair Encoding (BPE) tokenization. What this means is that "words" in the vocabulary are not full words, but groups of characters (for byte-level BPE, bytes) which occur often in text. Using the GPT-3 Byte-level BPE tokenizer, "Not all heroes wear capes" is split into tokens "Not" "all" "heroes" "wear" "cap" "es", which have ids 3673, 477, 10281, 5806, 1451, 274 in the vocabulary. Here is a very good introduction to the subject, and a github implementation so you can try it yourself.
- 2022 edit: OpenAI now has a tokenizer tool, which allows you to type some text and see how it gets broken down into tokens. [1] ...
- QUOTE: