Proof-Pile-2
Jump to navigation
Jump to search
A Proof-Pile-2 is a large corpus focused on mathematical documents and scientific documents.
- Context:
- ...
- See: The Pile Corpus.
References
2023
- https://huggingface.co/datasets/EleutherAI/proof-pile-2
- The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets:
- arxiv (29B tokens): the ArXiv subset of RedPajama
- open-web-math (15B tokens): The OpenWebMath dataset, which contains much of the high-quality mathematical text from the internet.
- algebraic-stack (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.
- The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets: