Large Language Model (LLM) Inference Evaluation Task

From GM-RKB
Revision as of 18:58, 6 April 2025 by Omoreira (talk | contribs)
Jump to navigation Jump to search

A Large Language Model (LLM) Inference Evaluation Task is a benchmarking task that can be used to evaluate the performance of a LLM inference system based on its output quality, robustness, and other dimensions.



References

2025

2024

2023a

2023b

2023c

2022

  • (Hendrycks et al., 2022) ⇒ Hendrycks, D., et al. (2022). "Massive Multitask Test". In: GitHub.
    • QUOTE: The Massive Multitask Test evaluates models across 57 tasks spanning multiple domains such as elementary mathematics, US history, computer science, and law.

      It provides a comprehensive benchmark for assessing general knowledge capabilities.