BPEmb Subword Embedding Algorithm

From GM-RKB

Jump to navigation Jump to search

A BPEmb Subword Embedding Algorithm is a Subword Embedding Algorithm that is based on a BPE Algorithm and trained on Wikipedia in 275 languages

AKA: BPEmb, BPEmb Subword Tokenization Algorithm.
Context:
- Source Code:
  - https://bpemb.h-its.org
  - https://github.com/bheinzerling/bpemb
Example(s):
- MultiBPEmb,
- …
Counter-Example(s):
- SentencePiece Algorithm,
- WordPiece Algorithm.
See: Subword Unit, Dictionary Encoding Algorithm, Lossless Compression Algorithm, Subword Tokenization Algorithm, Word Segmentation Algorithm, Entropy Encoding Algorithm, Data Compression Algorithm, Context Tree Weighting (CTW) Algorithm, OOV, BPE Subword Tokenization Algorithm, word2vec, GloVe.

References

2018

(Heinzerling & Strube, 2018) ⇒ Benjamin Heinzerling, and Michael Strube. (2018). “BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
- QUOTE: We presented BPEmb, a collection of subword embeddings trained on Wikipedias in 275 languages. Our evaluation showed that BPEmb performs as well as, and for some languages, better than other subword-based approaches. BPEmb requires no tokenization and is orders of magnitudes smaller than alternative embeddings, enabling potential use under resource constraints, e.g. on mobile devices.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=BPEmb_Subword_Embedding_Algorithm&oldid=784007"