OpenAI GPT-2 Large Language Model (LLM)

An OpenAI GPT-2 Large Language Model (LLM) is an OpenAI transformer-based language model.

Context:
- It can (typically) be a successor to GPT-1 and a predecessor to GPT-3.
- It can have source code available at: https://github.com/openai/gpt-2.
- It can be evaluated by a GPT-2 Benchmark Task.
- System's Architecture - It is based on a OpenAi-GPT LM Neural Network with the following modifications:
  - a layer normalization (Ba et al., 2016) moved to the input layer of each sub-block;
  - a pre-activation residual network (He et al., 2016) with an additional layer normalization added after the final self-attention block.
  - a modified weight initialization which accounts for the accumulation on the residual path with model depth.
  - a scaling of the residual layer's weights at initialization by a factor of $1/\sqrt{N}$.
  - an extended vocabulary with 50,257 items.
  - a context size is 1024 tokens; and batch size of 512 is used.
- Training System and other ML tools:
  - It trains a GPT-2 LM Neural Network to solved several NLP and NLG tasks such as: modelling of common names, named entities , and long-range dependencies in text as well as neural machine translation, text summarization, and question-answering generation.
  - It uses a byte pair encoding (BPE) for constructing input vector representations.
  - It uses a GPT-2 web scraper to create the WebText dataset.
- …
Example(s):
- one as described in (Radford et al., 2019).
- GPT-2 Small (also known as "117M" or "DistilGPT-2").
- GPT-2 Medium (also known as "345M").
- GPT-2 Large (also known as "774M").
- GPT-2 Extra Large (also known as "1.5B").
- GPT-2 Explorer Online System (https://gpt2.apps.allenai.org/).
- a Trained GPT-2 ONNX Format Model [1].
- …
Counter-Example(s):
- GPT-1 LLM, GPT-3 LLM.
- Huggingface Transformer GPT-2 [2].
- BERT,
- ELMO,
- OpenAI Gym,
- OpenAI RoboSumo,
- OpenAI Debate Game,
- OpenAI Five,
- OpenAI Dactyl,
See: Glutamic-Pyruvic Transaminase 2, OpenCog, Open Neural Network Exchange, Open-Source Robotics, Transformer Network, Language Model, Artificial Intelligence.

References

2024

(Karpathy, 2024a) ⇒ Andrej Karpathy. (2024). “Let's Reproduce GPT-2 (124M).” YouTube.
- NOTES:
  - It covers the entire process of reproducing the GPT-2 (124M) model from scratch, starting from understanding the model's architecture to setting up the training run and finally generating text samples. It emphasizes the importance of comprehending the underlying principles and techniques involved in replicating such a sophisticated model accurately.
  - It begins with the detailed implementation of the GPT-2 architecture in PyTorch, highlighting the differences from the original Transformer. It explains the modifications specific to GPT-2, such as the reordering of layer normalization and the addition of specific layers, ensuring a thorough understanding of the model's structure.
  - It includes loading the pre-trained GPT-2 model weights using the Hugging Face library, providing insights into the intricacies of handling token and positional embeddings. It ensures that viewers can correctly initialize and utilize the model weights to replicate the performance of the original GPT-2.
- QUOTE: We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. ...

2023

chat
- GPT-2, or Generative Pre-trained Transformer 2, is an autoregressive language model developed by OpenAI. It is based on the Transformer architecture introduced by Vaswani et al. in 2017. Like GPT-3, GPT-2 also employs a single stack of Transformer layers without separate encoder and decoder components. The architecture mainly consists of self-attention mechanisms and feed-forward layers.
  The full GPT-2 model has 1.5 billion parameters. However, OpenAI released several smaller versions of GPT-2 with fewer parameters, allowing users to choose a model that best fits their computational resources and performance requirements. Here's a list of the published GPT-2 model versions along with their number of parameters:
  - GPT-2 Small (also known as "117M" or "DistilGPT-2"): 117 million parameters, the smallest GPT-2 model, designed for lower-resource tasks and faster response times.
  - GPT-2 Medium (also known as "345M"): 345 million parameters, offering a balance between performance and computational requirements.
  - GPT-2 Large (also known as "774M"): 774 million parameters, a larger model with improved performance compared to the smaller variants.
  - GPT-2 Extra Large (also known as "1.5B"): 1.5 billion parameters, the largest and most powerful GPT-2 model, delivering the highest-quality results for various NLP tasks.

2019b

(OpenAI, 2019) ⇒ https://openai.com/blog/better-language-models/
- QUOTE: Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.
  GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages (the dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated filtered by humans — specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl.). ... GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

2019c

(Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/OpenAI#GPT2 Retrieved:2019-9-8.
- GPT2 (2019) is an AI system that generates text matching its input in subject and tone. For example, when fed the first sentence of George Orwell's novel Nineteen Eighty-Four it produces plausible futuristic fiction set in China. Unlike previous OpenAI products, GPT2 has not been released to the public out of concerns of potential misuse, including applications for writing fake news. Much of the academic community is skeptical that GPT2 poses a significant threat. The Allen Institute for Artificial Intelligence followed up with a tool to detect "neural fake news". Other researchers, like Jeremy Howard, warn of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter".

2019d

(Lee & Hsiang, 2019) ⇒ Jieh-Sheng Lee, and Jieh Hsiang. (2019). “Patent Claim Generation by Fine-Tuning OpenAI GPT-2.”
- QUOTE: Deep learning and pre-training models have demonstrated excellent results in several language tasks recently. Particularly, fine-tuning the pre-trained models such as ELMO (Embeddings from Language Models) [1], OpenAI GPT (Generative Pre-Training) [2], GPT-2 [3] and BERT (Bidirectional Encoder Representations from Transformers) [4 ] has become the best practice for state—of—the-art results. GPT-2 is the successor to GPT. Although both GPT-2 and BERT are capable of text generation, Wang and Cho [5] found that GPT-2 generations are of better quality. In fact, GPT-2 is claimed to be so powerful that the risk of its malicious use is high. For this reason, OpenAI decided to keep its largest model (1.5B parameters) closed so that there is more time to discuss its ramifications.

2019a

(Radford et al., 2019) ⇒ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. (2019). “Language Models Are Unsupervised Multitask Learners.” In: OpenAI Blog Journal, 1(8).
- QUOTE: We trained and benchmarked four LMs with approximately log—uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held—out sample of WebTeXt. All models still underﬁt WebText and held—out perplexity has as of yet improved given more training time.