MMLU (Massive Multitask Language Understanding) Benchmark

From GM-RKB
(Redirected from MMLU Benchmark (2024))
Jump to navigation Jump to search

An MMLU (Massive Multitask Language Understanding) Benchmark is a LLM benchmark task that evaluates a language model's ability to perform a wide range of zero-shot and few-shot language understanding tasks.



References

2023

2022

  • https://paperswithcode.com/dataset/mmlu
    • MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

2020

  • (Hendrycks et al., 2020) ⇒ Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. (2020). “Measuring Massive Multitask Language Understanding.” arXiv preprint arXiv:2009.03300.
    • We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

  1. "GLUE Benchmark" (in en). https://gluebenchmark.com/. Retrieved 2019-02-25. 
  2. Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-12/TeChapter.pdf. Retrieved February 24, 2019.