BPEmb Subword Embedding Algorithm
Jump to navigation
Jump to search
A BPEmb Subword Embedding Algorithm is a Subword Embedding Algorithm that is based on a BPE Algorithm and trained on Wikipedia in 275 languages
- AKA: BPEmb, BPEmb Subword Tokenization Algorithm.
- Context:
- Example(s):
- MultiBPEmb,
- …
- Counter-Example(s):
- See: Subword Unit, Dictionary Encoding Algorithm, Lossless Compression Algorithm, Subword Tokenization Algorithm, Word Segmentation Algorithm, Entropy Encoding Algorithm, Data Compression Algorithm, Context Tree Weighting (CTW) Algorithm, OOV, BPE Subword Tokenization Algorithm, word2vec, GloVe.
References
2018
- (Heinzerling & Strube, 2018) ⇒ Benjamin Heinzerling, and Michael Strube. (2018). “BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
- QUOTE: We presented BPEmb, a collection of subword embeddings trained on Wikipedias in 275 languages. Our evaluation showed that BPEmb performs as well as, and for some languages, better than other subword-based approaches. BPEmb requires no tokenization and is orders of magnitudes smaller than alternative embeddings, enabling potential use under resource constraints, e.g. on mobile devices.