mT5 LLM
Jump to navigation
Jump to search
A mT5 LLM is a multilingual LLM based on the T5 LLM and the mC4 corpus.
- Context:
- It can (typically) be trained on a mC4 Multilanguage Corpus.
- ...
- Example(s):
- mT5.
- BERT Multilingual.
- ...
- Counter-Example(s):
- Monolingual Language Models like GPT-3 (trained specifically on English-language data).
- Language-Specific NLP Models like BERT English or BERT Japanese.
- See: Natural Language Processing, Cross-Lingual Transfer, Language Model Pre-training, Zero-Shot Learning.
References
2024
- (Hugging Face, 2024) ⇒ In: Hugging Face Documentation.
- QUOTE: ... mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5 model. Since mT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
- Google has released the following variants:
2021
- (Xue et al., 2021) ⇒ Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. (2021). “mT5: A Massively Multilingual Pre-trained Text-to-text Transformer.” doi:10.48550/arXiv.2010.11934
- QUOTE: ... In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. ...
- NOTES:
- The paper introduces mT5, a multilingual variant of the T5 model, pre-trained on a newly developed Common Crawl-based dataset covering 101 languages, which is named mC4.
- The objective behind mT5 is to offer a massively multilingual model that closely follows the original T5 design, thereby inheriting all its advantages, such as its general-purpose text-to-text format, insights from empirical studies, and scalability.
- The mC4 dataset is an extended version of the C4 dataset, designed to include natural text across 101 languages, derived from the public Common Crawl web scrape, with modifications made to ensure data quality and relevance.
- mT5's performance was validated on multiple multilingual benchmarks, where it demonstrated state-of-the-art results in many cases. The paper also explores the issue of "accidental translation" during zero-shot tasks and proposes a simple technique to mitigate it.
- mT5 employs a text-to-text framework for all NLP tasks, using an encoder-decoder Transformer architecture. It was trained using a masked language modeling "span-corruption" objective, replacing consecutive spans of input tokens with a mask token and predicting these masked tokens.
- The paper discusses the importance of choosing a sampling strategy for training on data from multiple languages. To boost lower-resource languages, the authors sample examples based on the probability proportional to the size of the language dataset raised to the power of a hyperparameter α.
- The mT5 vocabulary size was increased to 250,000 wordpieces to accommodate over 100 languages, with adjustments made for languages with large character sets through the use of SentencePiece's "byte-fallback" feature.
- The authors compare mT5 to other massively multilingual pre-trained language models, highlighting its unique position in terms of architecture, parameter count, language coverage, and data source.
- The paper details experiments on the XTREME multilingual benchmark, showing that mT5 models of various sizes exceed or approach state-of-the-art performance across different NLP tasks and languages, underscoring the benefits of scaling up a simple pre-training recipe.