AI2 Dolma (2023)
Jump to navigation
Jump to search
A AI2 Dolma (2023) is a very-large text corpus dataset produced by AI2.
- Counter-Example(s):
- See: ....
References
2023
- https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64
- NOTES:
- Dolma is a 3 trillion token open dataset released by the Allen Institute for AI (AI2) in August 2022.
- It contains text from diverse sources including the web, academic publications, code, books, and encyclopedias.
- The goal is to use Dolma to train the Allen Institute's open language model called OLMo.
- Dolma aims to be the largest open dataset for language model pretraining to date.
- It was created with principles of openness, representativeness, size, reproducibility, and risk mitigation in mind.
- The data has gone through source-specific and general processing including deduplication, English-only filtering, risk mitigation, and adding a small fraction of code.
- Dolma is released under AI2's ImpACT license which requires stating intended use cases and disclosing derivatives.
- NOTES: