Holistic Evaluation of Language Models (HELM) Benchmarking Task

From GM-RKB
Jump to navigation Jump to search

A Holistic Evaluation of Language Models (HELM) Benchmarking Task is a LLM inference evaluation task that can be used to evaluate language models across multiple dimensions including accuracy, calibration, robustness, fairness, and toxicity.



References

2024a

2024b

2022