HotpotQA Benchmarking Task

AKA: Hotpot Question Answering Benchmark.
Context:
- Task Input: Multi-hop question.
- Optional Input: Supporting documents or retrieved paragraphs.
- Task Output: Answer span or synthesized answer.
- Task Performance Measure/Metrics: Exact Match (EM), F1, Supporting Fact Accuracy.
- It can take a question, a set of supporting documents, and expect an answer that requires reasoning across multiple facts.
- It can use exact match and F1 as performance metrics, and evaluate supporting fact prediction as a secondary objective.
- It can challenge LLMs to perform factual synthesis rather than retrieval or shallow extraction.
- It can be used for open-domain QA as well as context-grounded inference tasks.
- ...
Example(s):
- GPT-4 evaluated on HotpotQA using retrieval-augmented generation and chain-of-thought prompting.
- FiD (Fusion-in-Decoder) evaluated using HotpotQA for document-aware inference.
- T5 adapted for HotpotQA to combine reasoning and answer span prediction.
- ...
Counter-Example(s):
- a SQuAD Benchmarking Task, which focuses on single-paragraph context and extractive answers.
- Open-Domain QA Tasks like TriviaQA, which don’t require multi-hop synthesis.
- Entity Recognition Tasks, which identify phrases but don’t require answer generation.
- ...
See: HotpotQA Dataset, Question Answering, LLM Inference Evaluation Task, Multi-Hop Reasoning, Factuality Evaluation.

References

(Yang et al., 2018) ⇒ Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering". In: HotpotQA Official Website.
- QUOTE: HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.
  The dataset is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.
  Once you have built your model, you can use the evaluation script we provide to evaluate model performance by running `python hotpot_evaluate_v1.py <path_to_prediction> <path_to_gold>`.

(Yang et al., 2018) ⇒ Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering". In: arXiv preprints, arXiv:1809.09600.
- QUOTE: This paper introduces HotpotQA, a question answering dataset designed to facilitate the development of multi-hop reasoning and explainable AI systems.
  The dataset includes both distractor-based questions and Wikipedia-based questions, enabling researchers to test their models in different environments.
  Additionally, it provides annotations for supporting facts to improve the interpretability of model predictions.