Stanford Question Answering (SQuAD) Benchmark Task

A Stanford Question Answering (SQuAD) Benchmark Task is a LLM inference evaluation task that is a question answering benchmark task that can be used to assess a language model’s ability to answer factual questions based on context passages using extractive or generative outputs.

AKA: Stanford Question Answering Dataset Evaluation, SQuAD QA Benchmark.
Context:
- It can take a context passage and a question , with optional formatting or retrieval guidance, and generate an answer as a span or free-form text.
- It can measure the correctness of an answer using exact match and F1 score.
- It can evaluate extractive QA (SQuAD v1.1) or include unanswerable questions (SQuAD v2.0).
- It can test zero-shot, few-shot, or chain-of-thought inference abilities of LLMs.
- It can serve as a foundational benchmark for evaluating LLM comprehension and factual recall.
- It can range from simple span extraction to multi-hop reasoning (in extensions or variants).
- ...
Example(s):
- BERT evaluated on SQuAD v1.1 with context-question inputs and span-based answers, achieving high F1.
- T5 evaluated on SQuAD v2.0 for both answerable and unanswerable queries using generative decoding.
- GPT-3 tested on SQuAD using few-shot prompts to produce free-form answers.
- ...
Counter-Example(s):
- a TREC QA Task.
- a Jeopardy! Game.
- a VQA Benchmark Task.
- a SQuAD Fine-Tuning Task, which involves training on the dataset rather than evaluating pre-trained inference.
- a HotpotQA Benchmarking Task, which adds multi-hop reasoning, making it a more complex QA task.
- a Information Retrieval Task, which retrieves documents but does not generate natural language answers.
- ...
See: Automated QA, Reading Comprehension, LLM Inference Evaluation Task, Question Answering, Factuality Evaluation.

References

2023

(Hugging Face, 2023) ⇒ Hugging Face. (2023). "SQuAD Dataset". In: Hugging Face Datasets.
- QUOTE: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
  SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
  To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

2018a

(Rajpurkar et al., 2018) ⇒ Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". In: Association for Computational Linguistics (ACL).
- QUOTE: SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
  To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you.
  SQuAD2.0 tests the ability of a system to not only answer reading comprehension questions, but also abstain when presented with a question that cannot be answered based on the provided paragraph.

2018b

(Rajpurkar & Jia et al., 2018) ⇒ Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD". In: arXiv preprint arXiv:1806.03822.
- QUOTE: We present SQuAD2.0, the latest version of the Stanford Question Answering Dataset (SQuAD) which combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
  To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
  We show that the best systems trained on SQuAD1.1 achieve only 66% F1 on SQuAD2.0 test, a nearly 30% absolute drop in performance, despite reaching near-human performance on SQuAD1.1.

2017

(Seo et al., 2017) ⇒ Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. (2017). “Bidirectional Attention Flow for Machine Comprehension.” In: Proceedings of ICLR 2017.
- QUOTE: ... SQuAD is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions. The answer to each question is always a span in the context. The model is given a credit if its answer matches one of the human written answers. Two metrics are used to evaluate models: Exact Match (EM) and a softer metric, F1 score, which measures the weighted average of the precision and recall rate at character level. The dataset consists of 90k/10k train/dev question-context tuples with a large hidden test set. It is one of the largest available MC datasets with human-written questions and serves as a great test bed for our model. ...

2016

(Rajpurkar et al., 2016) ⇒ Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text". In:arXiv preprint arXiv:1606.05250.
- QUOTE: We present Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.
  With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.
  We show that a logistic regression model trained on crowd-sourced data outperforms a strong baseline (the percentage of questions correctly answered by humans), and that neural networks trained on SQuAD can achieve F1 scores of over 75%.