Large Language Model (LLM) Inference Evaluation Task

From GM-RKB
Jump to navigation Jump to search

A Large Language Model (LLM) Inference Evaluation Task is a benchmarking task that can be used to evaluate the performance of a LLM inference system based on its output quality, robustness, and other dimensions.



References

2025b

Benchmark Primary Task Type Input Optional Input Output (B) Performance Metrics Evaluation Style
GLUE Classification Text Pairs Task metadata Label Accuracy, F1 Automatic
SuperGLUE NLU Reasoning Structured Sentences Task definition Label or Text Average Score Automatic
SQuAD Extractive QA Context + Question N/A Answer Span Exact Match, F1 Automatic
MMLU Multi-domain MCQ Subject-Specific Question Subject label Answer Option Accuracy Automatic
HELM Multidimensional Evaluation Scenario Prompt Scenario metadata Text Generation Accuracy, Calibration, Bias Multi-metric
HotpotQA Multi-hop QA Question Supporting Docs Answer Span EM, F1 Automatic + Reasoning
TruthfulQA Adversarial QA Adversarial Question N/A Text Answer Truthfulness Score Human + Auto

2025b

2024

2023a

2023b

2023c

2022

  • (Hendrycks et al., 2022) ⇒ Hendrycks, D., et al. (2022). "Massive Multitask Test". In: GitHub.
    • QUOTE: The Massive Multitask Test evaluates models across 57 tasks spanning multiple domains such as elementary mathematics, US history, computer science, and law.

      It provides a comprehensive benchmark for assessing general knowledge capabilities.