AI2 Dolma (2023)

From GM-RKB

Jump to navigation Jump to search

A AI2 Dolma (2023) is a very-large text corpus dataset produced by AI2.

Counter-Example(s):
- Colossal Clean Crawled Corpus (C4).
- The Pile Text Corpus.
See: ....

References

2023

https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64
- NOTES:
  - Dolma is a 3 trillion token open dataset released by the Allen Institute for AI (AI2) in August 2022.
  - It contains text from diverse sources including the web, academic publications, code, books, and encyclopedias.
  - The goal is to use Dolma to train the Allen Institute's open language model called OLMo.
  - Dolma aims to be the largest open dataset for language model pretraining to date.
  - It was created with principles of openness, representativeness, size, reproducibility, and risk mitigation in mind.
  - The data has gone through source-specific and general processing including deduplication, English-only filtering, risk mitigation, and adding a small fraction of code.
  - Dolma is released under AI2's ImpACT license which requires stating intended use cases and disclosing derivatives.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=AI2_Dolma_(2023)&oldid=912252"

Concept