2021 MT5AMassivelyMultilingualPreTra
- (Xue et al., 2021) ⇒ Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. (2021). “mT5: A Massively Multilingual Pre-trained Text-to-text Transformer.” doi:10.48550/arXiv.2010.11934
Subject Headings: mT5 LLM, Multilingual LLM, mC4 Corpus.
Notes
- The paper introduces mT5, a multilingual variant of the T5 model, pre-trained on a newly developed Common Crawl-based dataset covering 101 languages, which is named mC4.
- The objective behind mT5 is to offer a massively multilingual model that closely follows the original T5 design, thereby inheriting all its advantages, such as its general-purpose text-to-text format, insights from empirical studies, and scalability.
- The mC4 dataset is an extended version of the C4 dataset, designed to include natural text across 101 languages, derived from the public Common Crawl web scrape, with modifications made to ensure data quality and relevance.
- mT5's performance was validated on multiple multilingual benchmarks, where it demonstrated state-of-the-art results in many cases. The paper also explores the issue of "accidental translation" during zero-shot tasks and proposes a simple technique to mitigate it.
- mT5 employs a text-to-text framework for all NLP tasks, using an encoder-decoder Transformer architecture. It was trained using a masked language modeling "span-corruption" objective, replacing consecutive spans of input tokens with a mask token and predicting these masked tokens.
- The paper discusses the importance of choosing a sampling strategy for training on data from multiple languages. To boost lower-resource languages, the authors sample examples based on the probability proportional to the size of the language dataset raised to the power of a hyperparameter α.
- The mT5 vocabulary size was increased to 250,000 wordpieces to accommodate over 100 languages, with adjustments made for languages with large character sets through the use of SentencePiece's "byte-fallback" feature.
- The authors compare mT5 to other massively multilingual pre-trained language models, highlighting its unique position in terms of architecture, parameter count, language coverage, and data source.
- The paper details experiments on the XTREME multilingual benchmark, showing that mT5 models of various sizes exceed or approach state-of-the-art performance across different NLP tasks and languages, underscoring the benefits of scaling up a simple pre-training recipe.
Cited By
Quotes
Abstract
The recent " Text-to-Text Transfer Transformer " (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent " accidental translation " in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2021 MT5AMassivelyMultilingualPreTra | Rami Al-Rfou Noah Constant Adam Roberts Colin Raffel Aditya Siddhant Linting Xue Mihir Kale Aditya Barua | mT5: A Massively Multilingual Pre-trained Text-to-text Transformer | 10.48550/arXiv.2010.11934 | 2021 |