GPT-2 Benchmark Task

Context:
- Task Input: text items.
- Task Output:
  - automatic generated text, translated text, summaries.
  - Benchmark performance metric scores.
- Task Requirement(s):
  - Benchmark Datasets:
    - Language modelling task datasets: WebText; WikiText-2, 1BW; Penn Treebank; LAMBADA; and Children's Book Test (CBT).
    - Reading comprehension task dataset: CoQA (Reddy et 31., 2018).
    - NMT task dataset: WMT-14 Fr-En (Artetxe et al., 2017).
    - Text summarization task datasets: CNN and Daily Mail (Hermann et al., 2015)
    - Question-answering generation task datasets: Kwiatkowski's Natural Questions (Kwiatkowski et al., 2019).
  - Benchmark Performance Metrics:
  - Baseline Models:
    - GPT-2 Language Model (reference model)
    - GPT-1 Language Model (Radford et al., 2018);
    - BERT Language Model (Devlin et al., 2018);
Example(s):
- Zero-shot performance scores (Radford et al., 2019):

	LAMBADA (PPL)	LAMBADA (ACC)	CB T-CN (ACC)	CBT-NE (ACC)	WikiText2 (PPL)	PTB (PPL)	enwik8 (BPB)	text8 (BPC)	WikiText103 (PPL)	1BW (PPL)
SOTA	99.8	59.23	85.7	82.3	39.14	46.54	0.99	1.08	18.3	21.8
117M	35.13	45.99	87.65	83.4	29.41	65.85	1.16	1.17	37.50	75.20
345M	15.60	55.48	92.35	87.1	22.76	47.33	1.01	1.06	26.37	55.72
762M	10.87	60.12	93.45	88.0	19.93	40.31	0.97	1.02	22.05	44.575
1542M	8.63	63.24	93.30	89.05	18.34	35.76	0.93	0.98	17.48	42.16

	R-1	R-2	R-L	R—AVG
Bottom-Up Sum	41.22	18.68	38.34	32.75
Lede-3	40.38	17.66	36.62	31.55
Seq2Seq + Attn	31.33	11.81	28.83	23.99
GPT-2 TL; DR:	29.34	8.27	26.58	21.40
Random-3	28.78	8.63	25.52	20.98
GPT-2 no hint	21.58	4.03	19.47	15.03

Counter-Example(s):
See: Computing Software Bechmark Task, OpenAI GPT, GPT-2 Neural Network, WebText Dataset, GPT-2 Web Scraper, Byte Pair Encoding (BPE), Word-level Language Model, Byte-level Language Model.

References

(Radford et al., 2019) ⇒ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. (2019). “Language Models Are Unsupervised Multitask Learners.” In: OpenAI Blog Journal, 1(8).
- QUOTE: We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2019). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underﬁt WebText and held-out perplexity has as of yet improved given more training time.