MMLU (Massive Multitask Language Understanding) Benchmark
(Redirected from MMLU (Massive Multitask Language Understanding))
Jump to navigation
Jump to search
A MMLU (Massive Multitask Language Understanding) Benchmark is a LLM benchmark task that evaluates a language model's ability to perform expert-level reasoning across academic and professional subjects.
- AKA: MMLU Benchmarking Task.
- Context:
- Task Input: Subject-specific multiple-choice question.
- Optional Input: Subject label (e.g., physics, medicine).
- Task Output: One selected answer option.
- Task Performance Measure/Metrics: Accuracy LLM Measure
- Benchmark Datasets and test models: https://github.com/hendrycks/test
- It can test language understanding across many domains, including STEM subjects, humanityes, social sciences, and professional domains.
- It can take a question with multiple-choice options and optionally a subject label, requiring the model to select the correct answer.
- It can evaluate output using accuracy LLM measure, either in zero-shot or few-shot inference settings.
- It can test generalization, reasoning, and subject-matter expertise of LLMs.
- It can range from factual recall to complex logical deduction.
- It can contain Elementary NLU Tasks and Advanced NLU Tasks.
- It can contain World Knowledge NLU Tasks and Problem Solving NLU Tasks.
- It can identify model strengths, weaknesses, and blindspots.
- It can aim to be more similar to evaluating human language understanding.
- It can have over 14,000 questions.
- It can have ~57 different domains
- ...
- Example(s):
- GPT-3 evaluated in few-shot mode using MMLU prompts and answer options with accuracy as the metric.
- GPT-4 achieving high expert-level scores across all MMLU categoryes using zero-shot chain-of-thought.
- Claude tested across the MMLU benchmark suite using uniform inference templates.
- ...
- Counter-Example(s):
- GLUE Benchmarking Task, which focuses on sentence-level classification rather than domain-specific reasoning.
- SQuAD Benchmarking Task, which involves extractive QA rather than multiple-choice inference.
- TriviaQA Benchmark, which tests open-domain QA instead of structured, subject-specific evaluation.
- SuperGLUE Benchmark.
- RACE Benchmark.
- BIG-bench Benchmark.
- See: AI Benchmark, Zero-shot Learning, Few-shot Learning, LLM Inference Evaluation Task, Expert QA, Zero-Shot Evaluation, Benchmarking Task.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Language_model#Evaluation_and_benchmarks Retrieved:2023-5-8.
- Evaluation of the quality of language models is mostly done by comparison to human-created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the learning rate, e.g. through inspection of learning curves. Various data sets have been developed to use to evaluate language processing systems.
- These include:
2022
- https://paperswithcode.com/dataset/mmlu
- MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
2021a
- (Hendrycks et al., 2021) ⇒ Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding". In: [International Conference on Learning Representations (ICLR)].
- QUOTE: We introduce a new test to measure a model's massive multitask language understanding.
The test covers 57 subjects including mathematics, physics, computer science, law, medicine, and more.
We evaluate numerous pretrained transformers including GPT-3 and find that even the largest models struggle to score much above chance, despite having enormous numbers of parameters.
- QUOTE: We introduce a new test to measure a model's massive multitask language understanding.
2021b
- (Hendrycks et al., 2021) ⇒ Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding". In: GitHub Repositor.
- QUOTE: This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).
This repository contains OpenAI API evaluation code, and the test is available for download **here**.
If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.
- QUOTE: This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).
2020
- (Hendrycks et al., 2020) ⇒ Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. (2020). “Measuring Massive Multitask Language Understanding.” arXiv preprint arXiv:2009.03300.
- We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.