GPT-J-6B
Jump to navigation
Jump to search
A GPT-J-6B is a GPT-J model with 6 billion parameters.
- Context:
- It can be trained on the Pile, a large-scale curated dataset created by EleutherAI.
- …
- See: GPT-3.
References
2023
- https://huggingface.co/EleutherAI/gpt-j-6B
- GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. “GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
- Hyperparameter Value
- nparametersn_{parameters}nparameters 6053381344
- nlayersn_{layers}nlayers 28*
- dmodeld_{model}dmodel 4096
- dffd_{ff}dff 16384
- nheadsn_{heads}nheads 16
- dheadd_{head}dhead 256
- nctxn_{ctx}nctx 2048
- nvocabn_{vocab}nvocab 50257/50400† (same tokenizer as GPT-2/3)
- Positional Encoding Rotary Position Embedding (RoPE)
- RoPE Dimensions 64
- * Each layer consists of one feedforward block and one self attention block.
- † Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.
- The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
- Training data: GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.
- GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. “GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.