GPT-J-6B

From GM-RKB
Jump to navigation Jump to search

A GPT-J-6B is a GPT-J model with 6 billion parameters.



References

2023

  • https://huggingface.co/EleutherAI/gpt-j-6B
    • GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. “GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
      • Hyperparameter Value
      • nparametersn_{parameters}nparameters​ 6053381344
      • nlayersn_{layers}nlayers​ 28*
      • dmodeld_{model}dmodel​ 4096
      • dffd_{ff}dff​ 16384
      • nheadsn_{heads}nheads​ 16
      • dheadd_{head}dhead​ 256
      • nctxn_{ctx}nctx​ 2048
      • nvocabn_{vocab}nvocab​ 50257/50400† (same tokenizer as GPT-2/3)
      • Positional Encoding Rotary Position Embedding (RoPE)
      • RoPE Dimensions 64
      • * Each layer consists of one feedforward block and one self attention block.
      • † Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.
    • The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
    • Training data: GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI.