MMLU (Massive Multitask Language Understanding) Benchmark
(Redirected from MMLU Benchmark (2024))
Jump to navigation
Jump to search
An MMLU (Massive Multitask Language Understanding) Benchmark is a LLM benchmark task that evaluates a language model's ability to perform a wide range of zero-shot and few-shot language understanding tasks.
- Context:
- It can test language understanding across many domains, including STEM subjects, humanities, social sciences, and professional domains.
- It can contain Elementary NLU Tasks and Advanced NLU Tasks.
- It can contain World Knowledge NLU Tasks and Problem Solving NLU Tasks.
- It can identify model strengths, weaknesses, and blindspots.
- It can aim to be more similar to evaluating human language understanding.
- It can have over 14,000 questions.
- It can have ~57 different domains
- ...
- Example(s):
- Counter-Example(s):
- See: AI Benchmark, Zero-shot Learning, Few-shot Learning.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Language_model#Evaluation_and_benchmarks Retrieved:2023-5-8.
- Evaluation of the quality of language models is mostly done by comparison to human-created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the learning rate, e.g. through inspection of learning curves. Various data sets have been developed to use to evaluate language processing systems.
- These include:
2022
- https://paperswithcode.com/dataset/mmlu
- MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
2020
- (Hendrycks et al., 2020) ⇒ Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. (2020). “Measuring Massive Multitask Language Understanding.” arXiv preprint arXiv:2009.03300.
- We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
- ↑ "GLUE Benchmark" (in en). https://gluebenchmark.com/. Retrieved 2019-02-25.
- ↑ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-12/TeChapter.pdf. Retrieved February 24, 2019.