Stanford Question Answering (SQuAD) Benchmark Task
Jump to navigation
Jump to search
A Stanford Question Answering (SQuAD) Benchmark Task is a question answering benchmark task.
- Example(s):
- Counter-Example(s):
- a TREC QA Task.
- a Jeopardy! Game.
- a VQA Benchmark Task.
- See: Automated QA, Reading Comprehension.
References
2017
- (Seo et al., 2017) ⇒ Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. (2017). “Bidirectional Attention Flow for Machine Comprehension.” In: Proceedings of ICLR 2017.
- QUOTE: ... SQuAD is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions. The answer to each question is always a span in the context. The model is given a credit if its answer matches one of the human written answers. Two metrics are used to evaluate models: Exact Match (EM) and a softer metric, F1 score, which measures the weighted average of the precision and recall rate at character level. The dataset consists of 90k/10k train/dev question-context tuples with a large hidden test set. It is one of the largest available MC datasets with human-written questions and serves as a great test bed for our model. ...
2016
- (Github, 2016) ⇒ https://rajpurkar.github.io/SQuAD-explorer/
- QUOTE: Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.
2016
- (Rajpurkar et al., 2016) ⇒ Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. (2016). “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” In: arXiv preprint arXiv:1606.05250.
- QUOTE: We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000 + questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL. http://stanford-qa.com