2024 TheLlama3HerdofModels

(AI@Meta Llama Team, 2024) ⇒ AI@Meta Llama Team. (2024). “The Llama 3 Herd of Models.” In: Meta AI Research.

Subject Headings: Llama LLM, Llama 3.1 LLM.

Notes

Cited By

http://scholar.google.com/scholar?q=%222024%22+The+Llama+3+Herd+of+Models

Quotes

Abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Certainly. I'll add a brief summary note after each main section. Here's the updated table of contents with notes:

1. Introduction

NOTE: Introduces Llama 3, a new family of foundation models for language, including variants with 8B, 70B, and 405B parameters.

2. General Overview

NOTE: Outlines the two-stage approach of pre-training and post-training, and introduces the compositional approach for multimodal capabilities.

2.1. Language Model Pre-training

2.2. Language Model Post-training

3. Pre-Training

3.1. Pre-Training Data

NOTE: It describes the approach to training data curation, including web data filtering, data de-duplication, data resampling, and quality control measures. Also discusses the data mix determination process and the use of annealing data for improved model performance.
- NOTE: The pre-training data mix consists of roughly 50% general knowledge, 25% mathematical and reasoning data, 17% code data, and 8% multilingual data.
- NOTE: Rigorous quality control measures include de-duplication at URL, document, and line levels, along with heuristic filtering to remove low-quality and repetitive content.
- NOTE: The data curation process includes domain-specific pipelines for extracting high-quality code and reasoning data, ensuring relevant and specialized training material.
- NOTE: Multilingual data processing involves using a fasttext-based language identification model and language-specific heuristics to categorize and filter documents in 176 languages.
- NOTE: Annealing on high-quality code and mathematical data during the final stages of pre-training boosts model performance on key benchmarks, highlighting the importance of targeted data augmentation.

3.2. Model Architecture

NOTE: It outlines the Llama 3 architecture, which is based on a standard dense Transformer with modifications like Grouped Query Attention (GQA). It also discusses the LLM scaling laws used to determine the optimal model size for the 405B parameter model.
NOTE: Grouped Query Attention (GQA) algorithm:

  Input: Query, Key, and Value tensors; number of query heads and key-value heads
  Output: Attention output tensor
  - Split queries into more heads than keys and values
  - For each query head:
    - Match it with a key-value head (cycling if necessary)
    - Compute attention scores and output
  - Concatenate outputs from all heads

3.3. Infrastructure, Scaling, and Efficiency

NOTE: It details the AI infrastructure used for Llama 3 training, including GPU utilization, storage solutions, and networking optimizations. Also discusses parallelism techniques for model scaling, including tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.
NOTE: Pipeline parallelism for training and inference algorithm:

  Input: Model, input data, number of pipeline stages
  Output: Model outputs and gradients
  - Divide model into stages across devices
  - For each microbatch:
    - Forward pass through stages sequentially
    - Pass output of each stage to next device
  - Collect final outputs from last stage
  - Perform backward pass in reverse order

NOTE: Tensor parallelism algorithm:

  Input: Large tensor operation, number of devices
  Output: Result of the distributed tensor operation
  - Split large tensors across multiple devices
  - Perform local computations on each device
  - Synchronize results across devices (e.g., all-reduce)

NOTE: Context parallelism algorithm:

  Input: Long input sequence, number of devices
  Output: Processed sequence with full context
  - Divide input sequence into chunks
  - Process chunks on different devices in parallel
  - Synchronize for operations requiring full context

NOTE: Fully Sharded Data Parallelism (FSDP) algorithm:

  Input: Model, batch of data, number of devices
  Output: Updated model parameters
  - Shard model parameters, gradients, and optimizer states
  - During forward pass:
    - Gather full parameters for current layer
    - Compute and release immediately after use
  - In backward pass:
    - Recompute forward activations as needed
    - Accumulate gradients locally
  - Synchronize gradients across all devices

3.4. Training Recipe

NOTE: It describes the three-stage training recipe for Llama 3, consisting of initial pre-training, long-context pre-training, and annealing. Includes details on learning rate schedules, batch sizes, and context length increases throughout the training process.

3.1. Pre-Training Data

3.2. Model Architecture

3.3. Infrastructure, Scaling, and Efficiency

3.4. Training Recipe

4. Post-Training

NOTE: Describes the post-training process, including modeling approaches, data preparation, and development of specific capabilities.

4.1. Modeling

NOTE: Direct Preference Optimization (DPO) algorithm:

  Input: Model, preferred and non-preferred response pairs
  Output: Aligned model
  - Given: preferred and non-preferred responses
  - Maximize likelihood ratio of preferred over non-preferred
  - Update model to align with human preferences

NOTE: Rejection sampling for improving response quality algorithm:

  Input: Prompt, language model, reward model
  Output: High-quality response
  - Generate multiple responses for a prompt
  - Score responses using a reward model
  - Select highest-scoring response
  - Add selected response to training data

NOTE: Monte Carlo Tree Search (MCTS) for reasoning traces algorithm:

  Input: Problem state, computation budget
  Output: Best action or reasoning step
  - Start with root problem state
  - While within computation budget:
    - Select: Traverse tree based on UCB scores
    - Expand: Add new node if not terminal
    - Simulate: Random rollout to end state
    - Backpropagate: Update node statistics
  - Choose best action based on visit counts

4.2. Post-training Data

4.3. Capabilities

5. Results

NOTE: Presents comprehensive evaluation results for both pre-trained and post-trained models, including human evaluations and safety assessments.

5.1. Pre-trained Language Model

5.2. Post-trained Language Model

5.3. Human Evaluations

5.4. Safety

NOTE: Rainbow Teaming for adversarial prompts Algorithm:

   Input: Target model, diversity dimensions
   Output: Diverse set of adversarial prompts
   - Define diversity dimensions for prompts
   - Generate initial population of prompts
   - Evaluate prompts against target model
   - Select best prompts in each dimension
   - Mutate and crossover to create new prompts
   - Repeat evaluation and evolution process

6. Inference

NOTE: Discusses techniques for efficient inference, including pipeline parallelism and FP8 quantization.

6.1. Pipeline Parallelism

6.2. FP8 Quantization

NOTE: FP8 quantization for efficient inference:

   Input: Full-precision model
   Output: Quantized model for efficient inference
   - Convert model weights to 8-bit floating point
   - During inference:
     - Dequantize weights to higher precision
     - Perform computation
     - Requantize results if necessary
   - Use dynamic scaling for better accuracy

NOTE: PagedAttention for efficient rejection sampling:

   Input: Prompts for multiple generations
   Output: Efficiently generated responses
   - Allocate fixed-size memory pages for attention cache
   - Dynamically assign pages to different requests
   - Reuse pages for multiple generations of same prompt
   - Deallocate pages when no longer needed

7. Vision Experiments

NOTE: Describes preliminary work on integrating visual capabilities, including image and video recognition.

7.1. Data

7.2. Model Architecture

7.3. Model Scaling

7.4. Pre-training

7.5. Post-Training

7.6. Image Recognition Results

7.7. Video Recognition Results

8. Speech Experiments

NOTE: Outlines experiments in adding speech understanding and generation capabilities to Llama 3.

8.1. Data

8.2. Model Architecture

8.3. Training Recipe

NOTE: BEST-RQ algorithm for speech encoder pre-training algorithm:

  Input: Speech spectrogram
  Output: Trained speech encoder
  - Apply random masks to input spectrogram
  - Quantize masked regions into discrete tokens
  - Predict masked tokens using surrounding context
  - Use multiple cod

8.4. Speech Understanding Results

8.5. Speech Generation Results

9. Related Work

NOTE: Provides context by discussing related work in language modeling and multimodal AI.

9.1. Language

NOTE: Discusses advancements in large language models, including LLM scaling trends, LLM architectural innovations, and LLM open-source developments.
- Scale: Mentions the trend of increasing model size and compute, referencing works like PaLM LLM (Chowdhery et al., 2022) and Gopher LLM (Rae et al., 2021).
- Small LLM models: Highlights efficient smaller models like Phi (LLM) (Abdin et al., 2024).
- LLM architectures: Discusses innovations like mixture-of-experts models (Shazeer et al., 2017, Fedus et al., 2022).
- Open source: Mentions other open-source LLM models like Mistral (Jiang et al., 2023) and Falcon (Almazrouei et al., 2023).
- Post-training: Discusses instruction tuning and alignment techniques (Ouyang et al., 2022, Bai et al., 2022).

9.2. Multimodality

NOTE: Explores recent developments in multimodal AI, focusing on vision-language models and speech-language integration.
- Vision-language models: Discusses works like CLIP (Radford et al., 2021), Flamingo (Alayrac et al., 2022), and GPT-4V (OpenAI, 2023).
- Video understanding: Mentions approaches like VideoChat (Li et al., 2023) and Video-LLaMA (Zhang, Li, Bing et al., 2023).
- Speech integration: References models like AudioPaLM (Rubenstein et al., 2023) and VioLA (Wang et al., 2023).
- Compositional multimodal approaches: Highlights methods for integrating multiple modalities, similar to the approach used in Llama 3.

10. Conclusion

NOTE: Summarizes the key contributions of Llama 3 and discusses future directions for research.

11. Contributors and Acknowledgements

NOTE: Lists the core contributors and acknowledges various individuals and teams involved in the development of Llama 3.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 TheLlama3HerdofModels	AI@Meta Llama Team			The Llama 3 Herd of Models						2024