Proof-Pile-2

References

https://huggingface.co/datasets/EleutherAI/proof-pile-2
- The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets:
  - arxiv (29B tokens): the ArXiv subset of RedPajama
  - open-web-math (15B tokens): The OpenWebMath dataset, which contains much of the high-quality mathematical text from the internet.
  - algebraic-stack (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.