GPT-J-6B

From GM-RKB

Jump to navigation Jump to search

A GPT-J-6B is a GPT-J model with 6 billion parameters.

Context:
- It can be trained on the Pile, a large-scale curated dataset created by EleutherAI.
- …
See: GPT-3.

References

2023

https://huggingface.co/EleutherAI/gpt-j-6B
- GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. “GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
  - Hyperparameter Value
  - nparametersn_{parameters}nparameters 6053381344
  - nlayersn_{layers}nlayers 28*
  - dmodeld_{model}dmodel 4096
  - dffd_{ff}dff 16384
  - nheadsn_{heads}nheads 16
  - dheadd_{head}dhead 256
  - nctxn_{ctx}nctx 2048
  - nvocabn_{vocab}nvocab 50257/50400† (same tokenizer as GPT-2/3)
  - Positional Encoding Rotary Position Embedding (RoPE)
  - RoPE Dimensions 64
  - * Each layer consists of one feedforward block and one self attention block.
  - † Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.
- The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
- Training data: GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=GPT-J-6B&oldid=880349"

Concept