CoQA Challenge

Contex:
- Resource(s): Software repository is available at https://stanfordnlp.github.io/coqa/
- Task Input(s): CoQA Datasets.
- Task Output(s): Performance Metrics.
- Task Requirement(s):
  - Benchmark Performance Metrics:
    - Macro-average F1 Score of word overlaps;
  - Baseline Models:
    - RoBERTa + AT + KD (ensemble);
    - TR-MT (ensemble);
    - RoBERTa + AT + KD (single model);
    - Google SQuAD 2.0 + MMFT (ensemble and single model);
    - XLNet + Augmentation (single model);
    - ConvBERT (ensemble);
    - BERT + MMFT + ADA (ensemble);
    - XLNet + MMFT + ADA (single model);
    - BERT + AttentionFusionNet (single model);
    - BERT + Answer Verification (single model);
    - BERT with History Augmented Query (single model);
    - BERT Large Fine-tuned Baseline (single model);
    - BERT Large Augmented (single model);
    - D-AoA + BERT (single model);
    - BERT Augmented + AoA (single model);
    - CNet (single model);
    - SDNet (ensemble);
    - CQANet (single model);
    - …
Counter-Example(s):
- SQuAD,
- GLUE,
- SuperGLUE.
See: CoQA Challenge, Question-Answering System, Natural Language Processing Task, Natural Language Understanding Task, Natural Language Generation Task.

References

(CoQA, 2020) ⇒ https://stanfordnlp.github.io/coqa/ Retrieved:2020-06-03.
- QUOTE: CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.
  (...)
  
  CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

↑ SQuAD also uses exact-match metric, however, we think F1 is more appropriate for our dataset because of the free-form answers.