WikiText Language Modeling Dataset
(Redirected from WikiText corpus)
Jump to navigation
Jump to search
A WikiText Language Modeling Dataset is a Language Modeling Dataset that is a collection of tokens extracted from Wikipedia articles.
- AKA: WikiText Dataset.
- Example(s):
- Counter-Example(s):
- See: Language Model, Semantic Word Similarity System, Reading Comprehension System, Natural Language Processing System, NLP Dataset, Machine Learning Dataset.
References
2021
- (DeepAI, 2021) ⇒ https://deepai.org/dataset/wikitext Retrieved:2021-09-19.
- QUOTE: The WikiText language modeling dataset is a collection of 100M+ tokens extracted from a collection of Wikipedia articles that have been verified as "Good" or "Featured"
2016
- (Merity et al., 2016) ⇒ Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. (2016). “Pointer Sentinel Mixture Models.” In: arXiv preprint arXiv:1609.07843.
- QUOTE: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution - ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
- QUOTE: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution - ShareAlike License.