GPT-2 Benchmark Task
Jump to navigation
Jump to search
A GPT-2 Benchmark Task is a Natural Language Processing Benchmark Task that evaluates the performance GPT-2 in solving NLP tasks.
- Context:
- Task Input: text items.
- Task Output:
- Task Requirement(s):
- Benchmark Datasets:
- Language modelling task datasets: WebText; WikiText-2, 1BW; Penn Treebank; LAMBADA; and Children's Book Test (CBT).
- Reading comprehension task dataset: CoQA (Reddy et 31., 2018).
- NMT task dataset: WMT-14 Fr-En (Artetxe et al., 2017).
- Text summarization task datasets: CNN and Daily Mail (Hermann et al., 2015)
- Question-answering generation task datasets: Kwiatkowski's Natural Questions (Kwiatkowski et al., 2019).
- Benchmark Performance Metrics:
- Zero-shot task performance metric (language modelling task datasets);
- Winograd Schema Challenge performance metrics (WebText dataset);
- Named entities, nouns, verbs, and prepositions language modelling accuracy: cloze test results (CBT datasets);
- Long-range dependencies in text modelling perplexity metric (LAMBDA dataset) ;
- Reading comprehension: F1 performance metric (CoQA dataset);
- NMT: BLEU score (WMT-14 Fr-En dataset) ;
- Text summarization: ROUGE (CNN and Daily Mail datasets);
- Question answering generation: exact match metric (Natural Questions dataset),
- Baseline Models:
- GPT-2 Language Model (reference model)
- GPT-1 Language Model (Radford et al., 2018);
- BERT Language Model (Devlin et al., 2018);
- Benchmark Datasets:
- Example(s):
LAMBADA (PPL) | LAMBADA (ACC) | CB T-CN (ACC) | CBT-NE (ACC) | WikiText2 (PPL) | PTB (PPL) | enwik8 (BPB) | text8 (BPC) | WikiText103 (PPL) | 1BW (PPL) | |
---|---|---|---|---|---|---|---|---|---|---|
SOTA | 99.8 | 59.23 | 85.7 | 82.3 | 39.14 | 46.54 | 0.99 | 1.08 | 18.3 | 21.8 |
117M | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1.17 | 37.50 | 75.20 |
345M | 15.60 | 55.48 | 92.35 | 87.1 | 22.76 | 47.33 | 1.01 | 1.06 | 26.37 | 55.72 |
762M | 10.87 | 60.12 | 93.45 | 88.0 | 19.93 | 40.31 | 0.97 | 1.02 | 22.05 | 44.575 |
1542M | 8.63 | 63.24 | 93.30 | 89.05 | 18.34 | 35.76 | 0.93 | 0.98 | 17.48 | 42.16 |
R-1 | R-2 | R-L | R—AVG | |
---|---|---|---|---|
Bottom-Up Sum | 41.22 | 18.68 | 38.34 | 32.75 |
Lede-3 | 40.38 | 17.66 | 36.62 | 31.55 |
Seq2Seq + Attn | 31.33 | 11.81 | 28.83 | 23.99 |
GPT-2 TL; DR: | 29.34 | 8.27 | 26.58 | 21.40 |
Random-3 | 28.78 | 8.63 | 25.52 | 20.98 |
GPT-2 no hint | 21.58 | 4.03 | 19.47 | 15.03 |
- Counter-Example(s):
- See: Computing Software Bechmark Task, OpenAI GPT, GPT-2 Neural Network, WebText Dataset, GPT-2 Web Scraper, Byte Pair Encoding (BPE), Word-level Language Model, Byte-level Language Model.
References
Referenes
2019a
- (Radford et al., 2019) ⇒ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. (2019). “Language Models Are Unsupervised Multitask Learners.” In: OpenAI Blog Journal, 1(8).
- QUOTE: We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2019). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.
2019b
- (Devlin et al., 2019) ⇒ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 1 (Long and Short Papers). DOI:10.18653/v1/N19-1423. arXiv:1810.04805