Large-Scale Language Model (LLM)

A Large-Scale Language Model (LLM) is a neural language model that is a deep neural network architecture (trained on extensive text corpuses to generate and process human language).

Context:
- Model Input: LLM input tokens formatted as LLM-compatible token sequences.
- Model Output: LLM output tokens forming LLM-generated text.
- ...
- It can (typically) have over 10M LLM parameters organized in LLM neural layers.
- It can (typically) reference an LLM training dataset containing LLM corpus documents from diverse LLM training sources.
- It can (typically) reference an LLM architecture, such as a transformer LLM architecture or diffusion LLM architecture.
- It can (typically) utilize LLM attention mechanisms to capture LLM contextual relationships between LLM tokens.
- It can (typically) process LLM input context of varying LLM context length to maintain LLM semantic understanding.
- It can (typically) have LLM capabilities, such as:
  - LLM natural language understanding through LLM contextual processing.
  - LLM human-like text generation through LLM language modeling.
  - LLM few-shot learning for LLM task adaptation.
  - LLM contextual memory across LLM conversation turns.
  - LLM semantic relationship modeling between LLM language elements.
  - LLM syntactic pattern recognition in LLM language structures.
  - LLM knowledge representation through LLM parameter encoding.
  - LLM logical reasoning via LLM step-by-step processing.
- ...
- It can (often) have LLM features such as LLM sampling parameters that control LLM output characteristics.
- It can (often) generate LLM output through either LLM sequential token prediction (in LLM autoregressive models) or through LLM iterative refinement (in LLM diffusion-based models).
- It can (often) utilize either LLM causal attention (in LLM autoregressive models) or LLM bidirectional attention (in LLM diffusion-based models) for LLM context processing.
- It can (often) trade off between LLM inference speed and LLM output coherence depending on its LLM generation mechanism.
- It can (often) employ LLM parameter-efficient adaptation techniques like LLM prompt tuning or LLM prefix tuning for LLM task specialization.
- It can (often) exhibit LLM emergent capabilities that weren't explicitly LLM trained behaviors.
- It can (often) demonstrate LLM in-context learning through LLM prompt examples.
- It can (often) facilitate LLM transfer learning from LLM pretraining tasks to LLM downstream tasks.
- It can (often) support LLM multimodal integration with LLM non-textual inputs when properly designed.
- It can (often) incorporate LLM alignment techniques to improve LLM output safety and LLM output helpfulness.
- ...
- It can range from being a Base Pretrained LLM to being a Finetuned LLM to being a Reasoning-enhanced LLM, depending on its LLM training approach.
- It can range from being an All-Domain LLM to being a Domain-Specific LLM, depending on its LLM training data domain coverage.
- It can range from being a Short-Context LLM (<=16K) to being a Long-Context LLM (>16K), depending on its LLM context window size.
- It can range from being a Closed-Source LLM to being an Open-Source LLM, depending on its LLM licensing model.
- It can range from being a Unilingual LLM to being a Multilingual LLM, depending on its LLM language support coverage.
- It can range from being an Autoregressive LLM to being a Diffusion-based LLM, depending on its LLM text generation approach.
- It can range from being a Decoder-only LLM to being an Encoder-only LLM to being an Encoder-Decoder LLM, depending on its LLM architectural design.
- It can range from being a Dense LLM to being a Sparse LLM to being a Mixture-of-Experts LLM, depending on its LLM parameter activation pattern.
- It can range from being a Small LLM (<10B) to being a Medium-sized LLM (10B-100B) to being a Large-scale LLM (>100B), depending on its LLM parameter count.
- It can range from being a General-purpose LLM to being a Specialized LLM, depending on its LLM application domain.
- It can range from being a Consumer-grade LLM to being an Enterprise-grade LLM, depending on its LLM performance requirements.
- ...
- It can have LLM generation architectures including LLM autoregressive generation for LLM sequential token prediction or LLM diffusion-based generation for LLM iterative token refinement.
- It can belong to an LLM model family with shared LLM architectural principles.
- It can be an input to an LLM inference task running on LLM deployment infrastructure.
- It can be used by an LLM-based system (solving an LLM-based task) with appropriate LLM integration methods.
- It can be trained using LLM distributed training infrastructure across multiple LLM accelerator devices.
- It can incorporate LLM external knowledge through LLM retrieval-augmented generation techniques.
- It can serve as an LLM foundation model for LLM downstream applications.
- It can engage in LLM multi-turn conversations with LLM coherence maintenance across LLM dialogue context.
- It can be enhanced through LLM reinforcement learning from human feedback (LLM RLHF) using LLM preference data.
- It can be optimized using LLM quantization techniques for LLM deployment efficiency and LLM latency reduction.
- It can support LLM prompt engineering techniques to guide LLM response generation.
- It can implement LLM chain-of-thought reasoning to solve LLM complex problems step-by-step.
- It can exhibit LLM instruction-following behavior when trained with LLM instruction datasets.
- It can utilize LLM caching mechanisms to improve LLM response time for LLM repeated queryies.
- ...
Examples:
- LLM architecture types, such as:
  - Autoregressive LLMs, such as:
    - GPT-4 LLM for LLM sequential text generation with LLM causal attention.
    - LLaMA 3 LLM for LLM unidirectional context processing in LLM next-token prediction.
    - Claude 3 LLM for LLM left-to-right text completion with LLM decoder-only architecture.
    - Mistral 7B LLM for LLM efficient inference with LLM grouped-query attention.
  - Diffusion-based LLMs, such as:
    - LLaDA LLM for LLM parallel token generation with LLM bidirectional attention.
    - Mercury LLM for LLM high-speed inference using LLM iterative refinement.
    - DEEM LLM for LLM multi-modal content generation through LLM diffusion processes.
    - Blended Diffusion LLM for LLM latent space text manipulation with LLM denoising approach.
- LLM performance characteristics, such as:
  - LLM context length capabilities, such as:
    - Short-context LLMs (≤16K tokens), such as:
      - GPT-3.5 LLM with LLM 4K context window.
      - Mistral 7B LLM with LLM 8K base context window.
    - Medium-context LLMs (16K-64K tokens), such as:
      - Claude 3 Sonnet LLM with LLM 48K context window.
      - Gemini Pro LLM with LLM 32K context window.
    - Long-context LLMs (>64K tokens), such as:
      - Claude 3 Opus LLM with LLM 200K context window.
      - GPT-4 Turbo LLM with LLM 128K context window.
      - Anthropic Metamind LLM with LLM 1M context window.
  - LLM size classifications, such as:
    - Small LLMs (<10B parameters), such as:
    - Medium-sized LLMs (10B-100B parameters), such as:
      - Llama 2 70B LLM for LLM balanced performance.
      - Claude 3 Sonnet LLM for LLM commercial applications.
      - Gemini Pro LLM for LLM enterprise deployment.
    - Large-scale LLMs (>100B parameters), such as:
      - GPT-4 LLM for LLM state-of-the-art performance.
      - Claude 3 Opus LLM for LLM complex reasoning tasks.
      - PaLM 2 LLM for LLM multilingual capability.
- Commercial LLMs, such as:
  - Major tech company LLMs, such as:
    - OpenAI LLMs, such as:
      - GPT-4 LLM for LLM advanced reasoning.
      - GPT-4V LLM for LLM multimodal understanding.
      - GPT-4o LLM for LLM real-time interaction.
    - Google LLMs, such as:
      - PaLM 2 LLM for LLM multilingual processing.
      - Gemini Ultra LLM for LLM frontier research.
      - Gemini Pro LLM for LLM commercial applications.
      - Gemini Nano LLM for LLM on-device deployment.
    - Meta LLMs, such as:
      - Llama 3 8B LLM for LLM open research.
      - Llama 3 70B LLM for LLM enterprise deployment.
      - Llama 3 405B LLM for LLM frontier capability.
    - Anthropic LLMs, such as:
      - Claude 3 Haiku LLM for LLM high-speed inference.
      - Claude 3 Sonnet LLM for LLM balanced performance.
      - Claude 3 Opus LLM for LLM maximum capability.
  - Regional tech leader LLMs, such as:
    - Asia-Pacific LLMs, such as:
      - Wu Dao 2.0 LLM: Made-in-China LLM by Beijing Academy of AI.
      - HyperClova X LLM: Made-in-South Korea LLM by Naver Corp.
      - Ernie Bot LLM: Made-in-China LLM by Baidu.
      - DeepSeek LLMs: Made-in-China LLM by DeepSeek.
    - European LLMs, such as:
      - BLOOM LLM: Made-in-EU LLM by Hugging Face collaborative.
      - Falcon LLM: Made-in-UAE LLM by Technology Innovation Institute.
      - Mistral LLM: Made-in-France LLM by Mistral AI.
- Research LLMs, such as:
  - Academic institution LLMs, such as:
    - University research LLMs, such as:
      - Stanford LLMs, such as Alpaca LLM for LLM instruction following.
      - Berkeley LLMs, such as Vicuna LLM for LLM chat applications.
      - CMU LLMs, such as OLMo LLM for LLM open research ecosystem.
  - Open source LLMs, such as:
    - HuggingFace LLMs, such as:
      - BLOOM LLM for LLM multilingual processing.
      - T5 LLM for LLM transfer learning.
      - BART LLM for LLM sequence-to-sequence tasks.
    - EleutherAI LLMs, such as:
      - GPT-Neo LLM for LLM open research.
      - GPT-J LLM for LLM efficient training.
      - Pythia LLM for LLM interpretability research.
- Specialized LLMs, such as:
  - Domain-specific LLMs, such as:
    - Code LLMs, such as:
      - Codex LLM for LLM software development.
      - StarCoder LLM for LLM programming assistance.
      - CodeLlama LLM for LLM code generation and completion.
      - DeepSeek Coder LLM for LLM multilingual programming.
    - Scientific LLMs, such as:
      - Galactica LLM for LLM scientific research.
      - PubMedGPT LLM for LLM biomedical analysis.
      - Med-PaLM LLM for LLM medical domain understanding.
      - BioGPT LLM for LLM biological text processing.
    - Legal LLMs, such as:
      - LexGPT LLM for LLM legal document analysis.
      - Doctrine LLM for LLM case law processing.
      - LawGPT LLM for LLM legal research assistance.
  - Architecture-specific LLMs, such as:
    - Encoder-only LLMs, such as:
      - BERT LLM for LLM bidirectional encoding.
      - RoBERTa LLM for LLM robust preprocessing.
      - DeBERTa LLM for LLM disentangled attention.
    - Encoder-decoder LLMs, such as:
      - T5 LLM for LLM text-to-text transfer.
      - BART LLM for LLM sequence transformation.
      - PEGASUS LLM for LLM abstractive summarization.
    - Hybrid architecture LLMs, such as:
      - Block-wise semi-autoregressive LLM for LLM balanced coherence and speed.
      - Combined diffusion-autoregressive LLM for LLM optimized generation.
- Efficiency-optimized LLMs, such as:
  - Distilled LLMs, such as:
  - Quantized LLMs, such as:
  - Sparse activation LLMs, such as:
- Reasoning-focused LLMs, such as:
  - Chain-of-thought LLMs, such as:
  - Verification-based LLMs, such as:
  - Tool-using LLMs, such as:
- ...
Counter-Examples:
- A visual perception model, which primarily processes image data rather than language token sequences and lacks LLM generation capabilities for human language.
- A multi-modal foundation model, which requires multiple input modality types beyond just text and uses different architectural components for each modality.
- A shallow neural language model, which has insufficient parameter count and architectural depth to qualify as large and cannot capture complex language patterns.
- A rule-based text generator, which uses symbolic rules and explicit grammars rather than neural architectures and lacks LLM learning capabilities.
- A traditional n-gram model, which lacks the deep learning capabilities of an LLM and can only model local word relationships without long-range dependency understanding.
- A retrieval-only language system, which merely retrieves existing text without generative capability and cannot produce novel responses to unseen queryies.
- A statistical machine translation system, which specifically converts text between languages rather than performing general text generation across diverse language tasks.
See: Multi-lingual neural network-based language model (NLM), LLM-based task, Diffusion-based large language model, Autoregressive language model, Transformer architecture, Foundation model, Instruction-tuned LLM, Reinforcement learning from human feedback (RLHF), Neural language model, LLM tokenization, LLM inference optimization, LLM evaluation benchmark, LLM fine-tuning technique, Prompt engineering, LLM alignment, Mixture of experts architecture.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Large_language_model#List Retrieved:2023-3-19.

List

Name	Release dateTemplate:Efn	Developer	Number of parametersTemplate:Efn	Corpus size	Training cost (petaFLOP-day)	LicenseTemplate:Efn	Notes
BERT	Template:Dts	Google	Template:Sort^[1]	Template:Sort words^[1]	Template:Sort^[2]	Apache 2.0^[3]	An early and influential language model,^[4] but encoder-only and thus not built to be prompted or generative^[5]
XLNet	Template:Dts	Google	Template:Sort^[6]	Template:Sort words			An alternative to BERT; designed as encoder-only^[7]^[8]
GPT-2	Template:Dts	OpenAI	Template:Sort^[9]	40GB^[10] (~Template:Sort tokens)^[11]		MIT^[12]	general-purpose model based on transformer architecture
GPT-3	Template:Dts	OpenAI	Template:Sort^[13]	Template:Sort tokens^[11]	3640^[14]	proprietary	A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.^[15]
GPT-Neo	Template:Dts	EleutherAI	Template:Sort^[16]	825 GiB^[17]		MIT^[18]	The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.^[18]
GPT-J	Template:Dts	EleutherAI	Template:Sort^[19]	825 GiB^[17]	200^[20]	Apache 2.0	GPT-3-style language model
Megatron-Turing NLG	Template:Dts^[21]	Microsoft and Nvidia	Template:Sort^[22]	Template:Sort tokens^[22]		Restricted web access	Standard architecture but trained on a supercomputing cluster.
Ernie 3.0 Titan	Template:Dts	Baidu	Template:Sort^[23]	4 Tb		Proprietary	Chinese-language LLM. Ernie Bot is based on this model.
Claude^[24]	Template:Dts	Anthropic	Template:Sort^[25]	Template:Sort tokens^[25]		Template:Partial success	Fine-tuned for desirable behavior in conversations.^[26]
GLaM (Generalist Language Model)	Template:Dts	Google	Template:Sort^[27]	Template:Sort tokens^[27]	5600^[27]	Proprietary	Sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher	Template:Dts	DeepMind	Template:Sort^[28]	Template:Sort tokens^[29]	5833^[30]	Proprietary
LaMDA (Language Models for Dialog Applications)	Template:Dts	Google	Template:Sort^[31]	1.56T words,^[31] Template:Sort tokens^[29]	4110^[32]	Proprietary	Specialized for response generation in conversations.
GPT-NeoX	Template:Dts	EleutherAI	Template:Sort^[33]	825 GiB^[17]	740^[20]	Apache 2.0	based on the Megatron architecture
Chinchilla	Template:Dts	DeepMind	Template:Sort^[34]	Template:Sort tokens^[34]^[29]	6805^[30]	Proprietary	Reduced-parameter model trained on more data. Used in the Sparrow bot.
PaLM (Pathways Language Model)	Template:Dts	Google	Template:Sort^[35]	Template:Sort tokens^[34]	29250^[30]	Proprietary	aimed to reach the practical limits of model scale
OPT (Open Pretrained Transformer)	Template:Dts	Meta	Template:Sort^[36]	Template:Sort tokens^[37]	310^[20]	Template:Partial success Template:Efn	GPT-3 architecture with some adaptations from Megatron
YaLM 100B	Template:Dts	Yandex	Template:Sort^[38]	1.7TB^[38]		Apache 2.0	English-Russian model based on Microsoft's Megatron-LM.
Minerva	Template:Dts	Google	Template:Sort^[39]	38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server^[39]		Proprietary	LLM trained for solving "mathematical and scientific questions using step-by-step reasoning".^[40] Minerva is based on PaLM model, further trained on mathematical and scientific data.
BLOOM	Template:Dts	Large collaboration led by Hugging Face	Template:Sort^[41]	Template:Sort tokens (1.6TB)^[42]		Responsible AI	Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica	Template:Dts	Meta	Template:Sort	Template:Sort tokens^[43]	unknown	Template:Partial success	Trained on scientific text and modalities.
AlexaTM (Teacher Models)	Template:Dts	Amazon	Template:Sort^[44]	Template:Sort^[45]		proprietary^[46]	bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)	Template:Dts	Meta	Template:Sort^[47]	Template:Sort^[47]	6300^[48]	Template:Partial success Template:Efn	Trained on a large 20-language corpus to aim for better performance with fewer parameters.^[47] Researchers from Stanford University trained a fine-tuned model based on LLaMA weights, called Alpaca.^[49]
GPT-4	Template:Dts	OpenAI	Exact number unknownTemplate:Efn	Unknown	Unknown	proprietary	Available for ChatGPT Plus users and used in several products.
Cerebras-GPT	Template:Dts	Cerebras	Template:Sort^[50]		270^[20]	Apache 2.0	Trained with Chinchilla formula.
Falcon	Template:Dts	Technology Innovation Institute	Template:Sort^[51]	1 trillion tokens, from RefinedWeb (filtered web text corpus)^[52] plus some "curated corpora".^[53]	2800^[48]	Apache 2.0^[54]	Training cost around 2700 petaFLOP-days, 75% that of GPT-3.
BloombergGPT	Template:Dts	Bloomberg L.P.	Template:Sort	363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets^[55]		Proprietary	LLM trained on financial data from proprietary sources, that "outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks"
PanGu-Σ	Template:Dts	Huawei	Template:Sort	329 billion tokens^[56]		Proprietary
OpenAssistant^[57]	Template:Dts	LAION	Template:Sort	1.5 trillion tokens		Apache 2.0	Trained on crowdsourced open data
Jurassic-2^[58]	Template:Dts	AI21 Labs	Exact size unknown	Unknown		Proprietary	Multilingual^[59]
PaLM 2 (Pathways Language Model 2)	Template:Dts	Google	Template:Sort^[60]	Template:Sort tokens^[60]	85000^[48]	Proprietary	Used in Bard chatbot.^[61]
Llama 2	Template:Dts	Meta	Template:Sort^[62]	Template:Sort tokens^[62]		Template:Partial success	Successor of LLaMA.

2022

(Zhou et al., 2022) ⇒ Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. (2022). “Large Language Models Are Human-level Prompt Engineers.” In: arXiv preprint arXiv:2211.01910.
- QUOTE: ... By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. ...

2020

(Liu et al., 2020) ⇒ Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. (2020). “Adversarial Training for Large Neural Language Models.” arXiv preprint arXiv:2004.08994
- QUOTE: … Pre-training a large neural language model such as BERT has proven effective to improve generalization performance in task-specific fine-tuning (Devlin et al.…

↑ ^1.0 ^1.1 Cite error: Invalid <ref> tag; no text was provided for refs named bert-paper
↑ Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models" (in en-US). https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/. Retrieved 2023-06-20.
↑ "BERT". March 13, 2023. https://github.com/google-research/bert.
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Manning-2022
↑ Template:Cite arXiv
↑ "BERT, RoBERTa, DistilBERT, XLNet: Which one to use?". https://www.kdnuggets.com/bert-roberta-distilbert-xlnet-which-one-to-use.html.
↑ Naik, Amit Raja (September 23, 2021). "Google Introduces New Architecture To Reduce Cost Of Transformers". https://analyticsindiamag.com/google-introduces-new-architecture-to-reduce-cost-of-transformers/.
↑ Template:Cite arXiv
↑ Cite error: Invalid <ref> tag; no text was provided for refs named 15Brelease
↑ "Better language models and their implications". https://openai.com/research/better-language-models.
↑ ^11.0 ^11.1 "OpenAI's GPT-3 Language Model: A Technical Overview" (in en). 3 June 2020. https://lambdalabs.com/blog/demystifying-gpt-3.
↑ "gpt-2". GitHub. https://github.com/openai/gpt-2. Retrieved 13 March 2023.
↑ Cite error: Invalid <ref> tag; no text was provided for refs named Wiggers
↑ Table D.1 in Template:Cite arXiv
↑ Cite error: Invalid <ref> tag; no text was provided for refs named chatgpt-blog
↑ "GPT Neo". March 15, 2023. https://github.com/EleutherAI/gpt-neo.
↑ ^17.0 ^17.1 ^17.2 Template:Cite arXiv
↑ ^18.0 ^18.1 Cite error: Invalid <ref> tag; no text was provided for refs named vb-gpt-neo
↑ "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" (in en). https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model. Retrieved 2023-02-28.
↑ ^20.0 ^20.1 ^20.2 ^20.3 Template:Cite arXiv
↑ Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/.
↑ ^22.0 ^22.1 Cite error: Invalid <ref> tag; no text was provided for refs named mtnlg-preprint
↑ Template:Cite arXiv
↑ "Product" (in en). https://www.anthropic.com/product. Retrieved 14 March 2023.
↑ ^25.0 ^25.1 Template:Cite arXiv
↑ Template:Cite arXiv
↑ ^27.0 ^27.1 ^27.2 Cite error: Invalid <ref> tag; no text was provided for refs named glam-blog
↑ "Language modelling at scale: Gopher, ethical considerations, and retrieval" (in en). https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval. Retrieved 20 March 2023.
↑ ^29.0 ^29.1 ^29.2 Template:Cite arXiv
↑ ^30.0 ^30.1 ^30.2 Table 20 of PaLM: Scaling Language Modeling with Pathways
↑ ^31.0 ^31.1 Cite error: Invalid <ref> tag; no text was provided for refs named lamda-blog
↑ Template:Cite arXiv
↑ Template:Cite conference
↑ ^34.0 ^34.1 ^34.2 Cite error: Invalid <ref> tag; no text was provided for refs named chinchilla-blog
↑ Cite error: Invalid <ref> tag; no text was provided for refs named palm-blog
↑ "Democratizing access to large-scale language models with OPT-175B" (in en). https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/.
↑ Template:Cite arXiv
↑ ^38.0 ^38.1 Template:Citation
↑ ^39.0 ^39.1 Template:Cite arXiv
↑ "Minerva: Solving Quantitative Reasoning Problems with Language Models" (in en). 30 June 2022. https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html. Retrieved 20 March 2023.
↑ Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature 615 (7951): 202–205. Bibcode 2023Natur.615..202A. doi:10.1038/d41586-023-00641-w. PMID 36890378. https://www.nature.com/articles/d41586-023-00641-w.
↑ "bigscience/bloom · Hugging Face". https://huggingface.co/bigscience/bloom.
↑ Template:Cite arXiv
↑ "20B-parameter Alexa model sets new marks in few-shot learning" (in en). 2 August 2022. https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning.
↑ Template:Cite arXiv
↑ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". 17 November 2022. https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/. Retrieved 13 March 2023.
↑ ^47.0 ^47.1 ^47.2 Cite error: Invalid <ref> tag; no text was provided for refs named llama-blog
↑ ^48.0 ^48.1 ^48.2 "The Falcon has landed in the Hugging Face ecosystem". https://huggingface.co/blog/falcon. Retrieved 2023-06-20.
↑ "Stanford CRFM". https://crfm.stanford.edu/2023/03/13/alpaca.html.
↑ Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models". https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/.
↑ "Abu Dhabi-based TII launches its own version of ChatGPT". https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/.
↑ Template:Cite arXiv
↑ "tiiuae/falcon-40b · Hugging Face". 2023-06-09. https://huggingface.co/tiiuae/falcon-40b. Retrieved 2023-06-20.
↑ UAE’s Falcon 40B, World’s Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free, 31 May 2023
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI" (in en-US). https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/. Retrieved 2023-07-24.
↑ Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race" (in en-US). https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/. Retrieved 2023-07-24.
↑ ^60.0 ^60.1 Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its predecessor". CNBC. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html. Retrieved 18 May 2023.
↑ "Introducing PaLM 2". May 10, 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/.
↑ ^62.0 ^62.1 "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model" (in en). 2023. https://ai.meta.com/llama/. Retrieved 2023-07-19.

[bert-paper-1] 1.0 ^1.1 Cite error: Invalid <ref> tag; no text was provided for refs named bert-paper

[bHZJ2-2] Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models" (in en-US). https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/. Retrieved 2023-06-20.

[bert-web-3] "BERT". March 13, 2023. https://github.com/google-research/bert.

[Manning-2022-4] Cite error: Invalid <ref> tag; no text was provided for refs named Manning-2022

[Ir545-5] Template:Cite arXiv

[45rAm-6] "BERT, RoBERTa, DistilBERT, XLNet: Which one to use?". https://www.kdnuggets.com/bert-roberta-distilbert-xlnet-which-one-to-use.html.

[gAbNO-7] Naik, Amit Raja (September 23, 2021). "Google Introduces New Architecture To Reduce Cost Of Transformers". https://analyticsindiamag.com/google-introduces-new-architecture-to-reduce-cost-of-transformers/.

[LX3rI-8] Template:Cite arXiv

[15Brelease-9] Cite error: Invalid <ref> tag; no text was provided for refs named 15Brelease

[5T8u5-10] "Better language models and their implications". https://openai.com/research/better-language-models.

[LambdaLabs-11] 11.0 ^11.1 "OpenAI's GPT-3 Language Model: A Technical Overview" (in en). 3 June 2020. https://lambdalabs.com/blog/demystifying-gpt-3.

[Sudbe-12] "gpt-2". GitHub. https://github.com/openai/gpt-2. Retrieved 13 March 2023.

[Wiggers-13] Cite error: Invalid <ref> tag; no text was provided for refs named Wiggers

[:2-14] Table D.1 in Template:Cite arXiv

[chatgpt-blog-15] Cite error: Invalid <ref> tag; no text was provided for refs named chatgpt-blog

[gpt-neo-16] "GPT Neo". March 15, 2023. https://github.com/EleutherAI/gpt-neo.

[Pile-17] 17.0 ^17.1 ^17.2 Template:Cite arXiv

[vb-gpt-neo-18] 18.0 ^18.1 Cite error: Invalid <ref> tag; no text was provided for refs named vb-gpt-neo

[JxohJ-19] "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" (in en). https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model. Retrieved 2023-02-28.

[:3-20] 20.0 ^20.1 ^20.2 ^20.3 Template:Cite arXiv

[BwnW5-21] Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/.

[mtnlg-preprint-22] 22.0 ^22.1 Cite error: Invalid <ref> tag; no text was provided for refs named mtnlg-preprint

[qeOB8-23] Template:Cite arXiv

[i8jc4-24] "Product" (in en). https://www.anthropic.com/product. Retrieved 14 March 2023.

[AnthroArch-25] 25.0 ^25.1 Template:Cite arXiv

[RZqhw-26] Template:Cite arXiv

[glam-blog-27] 27.0 ^27.1 ^27.2 Cite error: Invalid <ref> tag; no text was provided for refs named glam-blog

[mD5eE-28] "Language modelling at scale: Gopher, ethical considerations, and retrieval" (in en). https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval. Retrieved 20 March 2023.

[hoffman-29] 29.0 ^29.1 ^29.2 Template:Cite arXiv

[:4-30] 30.0 ^30.1 ^30.2 Table 20 of PaLM: Scaling Language Modeling with Pathways

[lamda-blog-31] 31.0 ^31.1 Cite error: Invalid <ref> tag; no text was provided for refs named lamda-blog

[DMs9Z-32] Template:Cite arXiv

[gpt-neox-20b-33] Template:Cite conference

[chinchilla-blog-34] 34.0 ^34.1 ^34.2 Cite error: Invalid <ref> tag; no text was provided for refs named chinchilla-blog

[palm-blog-35] Cite error: Invalid <ref> tag; no text was provided for refs named palm-blog

[jlof8-36] "Democratizing access to large-scale language models with OPT-175B" (in en). https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/.

[QjTIc-37] Template:Cite arXiv

[yalm-repo-38] 38.0 ^38.1 Template:Citation

[minerva-paper-39] 39.0 ^39.1 Template:Cite arXiv

[FfCNK-40] "Minerva: Solving Quantitative Reasoning Problems with Language Models" (in en). 30 June 2022. https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html. Retrieved 20 March 2023.

[bigger-better-41] Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature 615 (7951): 202–205. Bibcode 2023Natur.615..202A. doi:10.1038/d41586-023-00641-w. PMID 36890378. https://www.nature.com/articles/d41586-023-00641-w.

[B8wB2-42] "bigscience/bloom · Hugging Face". https://huggingface.co/bigscience/bloom.

[37sY6-43] Template:Cite arXiv

[u5szh-44] "20B-parameter Alexa model sets new marks in few-shot learning" (in en). 2 August 2022. https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning.

[HaA7l-45] Template:Cite arXiv

[rpehM-46] "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". 17 November 2022. https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/. Retrieved 13 March 2023.

[llama-blog-47] 47.0 ^47.1 ^47.2 Cite error: Invalid <ref> tag; no text was provided for refs named llama-blog

[:5-48] 48.0 ^48.1 ^48.2 "The Falcon has landed in the Hugging Face ecosystem". https://huggingface.co/blog/falcon. Retrieved 2023-06-20.

[KBedq-49] "Stanford CRFM". https://crfm.stanford.edu/2023/03/13/alpaca.html.

[D0k2a-50] Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models". https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/.

[falcon-51] "Abu Dhabi-based TII launches its own version of ChatGPT". https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/.

[Xb1gq-52] Template:Cite arXiv

[gzTNw-53] "tiiuae/falcon-40b · Hugging Face". 2023-06-09. https://huggingface.co/tiiuae/falcon-40b. Retrieved 2023-06-20.

[Wmlcs-54] UAE’s Falcon 40B, World’s Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free, 31 May 2023

[nGOSu-55] Template:Cite arXiv

[9WSFw-56] Template:Cite arXiv

[JiOl8-57] Template:Cite arXiv

[58] Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI" (in en-US). https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/. Retrieved 2023-07-24.

[59] Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race" (in en-US). https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/. Retrieved 2023-07-24.

[cnbc-20230516-60] 60.0 ^60.1 Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its predecessor". CNBC. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html. Retrieved 18 May 2023.

[pWyLA-61] "Introducing PaLM 2". May 10, 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/.

[meta-20230719-62] 62.0 ^62.1 "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model" (in en). 2023. https://ai.meta.com/llama/. Retrieved 2023-07-19.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]