Decoder-only Transformer-based Large Language Model (LLM)

Context:
- It can be optimized for generating coherent and contextually relevant text.
- It can be trained on large-scale datasets, allowing it to develop a broad understanding of language and context.
- It can (often) face challenges such as hardware failures and loss divergences during its extensive and resource-intensive training processes.
- ...
Example(s):
- GPT LLM.
- PaLM 540B.
- BLOOM model.
- ...
Counter-Example(s):
- Encoder-Decoder Transformer-based LLM.
- BERT (Bidirectional Encoder Representations from Transformers).
See: Auto-Regressive Transformer-based LLM, Self-Attention Mechanism, Large Scale Language Model Training.

References

chat
- Decoder-only models concentrate on autoregressive language modeling and text generation. These models predict the next word in a sequence given the previous words while attending only to the left context. GPT (Generative Pre-trained Transformer) is a well-known example of a decoder-only LLM. GPT uses the Transformer's decoder layers and is trained to generate text by predicting the next token in a sequence. This makes GPT particularly effective for tasks such as text generation, summarization, and translation.

(Fu et al., 2023) ⇒ Z Fu, W Lam, Q Yu, AMC So, S Hu, Z Liu, .... (2023). “Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder."
- NOTE: It explores the debate between the traditional Language Model built with a Transformer decoder and the encoder-decoder approach. The authors analyze the concatenation of source and target sequences for training a Language Model and their implications during predictions.

(Mohsan et al., 2022) ⇒ MM. Mohsan, MU. Akram, G. Rasool, NS. Alghamdi, .... (2022). “Vision Transformer and Language Model Based Radiology Report Generation."
- NOTE: It explores the integration of the Transformer model in radiology report generation. It specifically investigates the use of a Transformer with pre-trained weights in the decoder while leveraging a conventional CNN in the encoder pre-trained on CXR images.

(Maruyama & Yamamoto, 2019) ⇒ T. Maruyama, K. Yamamoto. (2019). “Extremely low resource text simplification with pre-trained transformer language model."
- NOTE: It explores the challenge of text simplification under resource constraints. They consider various pre-training configurations for the encoder and decoder, examining the efficacy of each.

(Vig & Belinkov, 2019) ⇒ J. Vig, Y. Belinkov. (2019). “Analyzing the structure of attention in a transformer language model."
- NOTE: It rexplores the multi-head self-attention mechanism within the Transformer model. The authors offer insights through visualization of attention at various levels and discuss the encoder-decoder approach in contrast to the decoder-only GPT-2 model.

https://towardsdatascience.com/gpt-3-transformers-and-the-wild-world-of-nlp-9993d8bb1314
- QUOTE: 2.2 Architecture
  - In terms of architecture, transformer models are quite similar. Most of the models follow the same architecture as one of the “founding fathers”, the original transformer, BERT and GPT. They represent three basic architectures: encoder only, decoder only and both.
    - Decoder only (GPT): In many ways, an encoder with a CLM head can be considered a decoder. Instead of outputting hidden states, decoders are wired to generate sequences in an auto-regressive way, whereby the previous generated word is used as input to generate the next one.