2024 TheLlama3HerdofModels
- (AI@Meta Llama Team, 2024) ⇒ AI@Meta Llama Team. (2024). “The Llama 3 Herd of Models.” In: Meta AI Research.
Subject Headings: Llama LLM, Llama 3.1 LLM.
Notes
Cited By
Quotes
Abstract
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Certainly. I'll add a brief summary note after each main section. Here's the updated table of contents with notes:
1. Introduction
- NOTE: Introduces Llama 3, a new family of foundation models for language, including variants with 8B, 70B, and 405B parameters.
2. General Overview
- NOTE: Outlines the two-stage approach of pre-training and post-training, and introduces the compositional approach for multimodal capabilities.
2.1. Language Model Pre-training
2.2. Language Model Post-training
3. Pre-Training
3.1. Pre-Training Data
- NOTE: It describes the approach to training data curation, including web data filtering, data de-duplication, data resampling, and quality control measures. Also discusses the data mix determination process and the use of annealing data for improved model performance.
- NOTE: The pre-training data mix consists of roughly 50% general knowledge, 25% mathematical and reasoning data, 17% code data, and 8% multilingual data.
- NOTE: Rigorous quality control measures include de-duplication at URL, document, and line levels, along with heuristic filtering to remove low-quality and repetitive content.
- NOTE: The data curation process includes domain-specific pipelines for extracting high-quality code and reasoning data, ensuring relevant and specialized training material.
- NOTE: Multilingual data processing involves using a fasttext-based language identification model and language-specific heuristics to categorize and filter documents in 176 languages.
- NOTE: Annealing on high-quality code and mathematical data during the final stages of pre-training boosts model performance on key benchmarks, highlighting the importance of targeted data augmentation.
3.2. Model Architecture
- NOTE: It outlines the Llama 3 architecture, which is based on a standard dense Transformer with modifications like Grouped Query Attention (GQA). It also discusses the LLM scaling laws used to determine the optimal model size for the 405B parameter model.
- NOTE: Grouped Query Attention (GQA) algorithm:
Input: Query, Key, and Value tensors; number of query heads and key-value heads Output: Attention output tensor - Split queries into more heads than keys and values - For each query head: - Match it with a key-value head (cycling if necessary) - Compute attention scores and output - Concatenate outputs from all heads
3.3. Infrastructure, Scaling, and Efficiency
- NOTE: It details the AI infrastructure used for Llama 3 training, including GPU utilization, storage solutions, and networking optimizations. Also discusses parallelism techniques for model scaling, including tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.
- NOTE: Pipeline parallelism for training and inference algorithm:
Input: Model, input data, number of pipeline stages Output: Model outputs and gradients - Divide model into stages across devices - For each microbatch: - Forward pass through stages sequentially - Pass output of each stage to next device - Collect final outputs from last stage - Perform backward pass in reverse order
- NOTE: Tensor parallelism algorithm:
Input: Large tensor operation, number of devices Output: Result of the distributed tensor operation - Split large tensors across multiple devices - Perform local computations on each device - Synchronize results across devices (e.g., all-reduce)
Input: Long input sequence, number of devices Output: Processed sequence with full context - Divide input sequence into chunks - Process chunks on different devices in parallel - Synchronize for operations requiring full context
Input: Model, batch of data, number of devices Output: Updated model parameters - Shard model parameters, gradients, and optimizer states - During forward pass: - Gather full parameters for current layer - Compute and release immediately after use - In backward pass: - Recompute forward activations as needed - Accumulate gradients locally - Synchronize gradients across all devices
3.4. Training Recipe
- NOTE: It describes the three-stage training recipe for Llama 3, consisting of initial pre-training, long-context pre-training, and annealing. Includes details on learning rate schedules, batch sizes, and context length increases throughout the training process.
3.1. Pre-Training Data
3.2. Model Architecture
3.3. Infrastructure, Scaling, and Efficiency
3.4. Training Recipe
4. Post-Training
- NOTE: Describes the post-training process, including modeling approaches, data preparation, and development of specific capabilities.
4.1. Modeling
Input: Model, preferred and non-preferred response pairs Output: Aligned model - Given: preferred and non-preferred responses - Maximize likelihood ratio of preferred over non-preferred - Update model to align with human preferences
Input: Prompt, language model, reward model Output: High-quality response - Generate multiple responses for a prompt - Score responses using a reward model - Select highest-scoring response - Add selected response to training data
Input: Problem state, computation budget Output: Best action or reasoning step - Start with root problem state - While within computation budget: - Select: Traverse tree based on UCB scores - Expand: Add new node if not terminal - Simulate: Random rollout to end state - Backpropagate: Update node statistics - Choose best action based on visit counts
4.2. Post-training Data
4.3. Capabilities
5. Results
- NOTE: Presents comprehensive evaluation results for both pre-trained and post-trained models, including human evaluations and safety assessments.
5.1. Pre-trained Language Model
5.2. Post-trained Language Model
5.3. Human Evaluations
5.4. Safety
Input: Target model, diversity dimensions Output: Diverse set of adversarial prompts - Define diversity dimensions for prompts - Generate initial population of prompts - Evaluate prompts against target model - Select best prompts in each dimension - Mutate and crossover to create new prompts - Repeat evaluation and evolution process
6. Inference
- NOTE: Discusses techniques for efficient inference, including pipeline parallelism and FP8 quantization.
6.1. Pipeline Parallelism
6.2. FP8 Quantization
Input: Full-precision model Output: Quantized model for efficient inference - Convert model weights to 8-bit floating point - During inference: - Dequantize weights to higher precision - Perform computation - Requantize results if necessary - Use dynamic scaling for better accuracy
Input: Prompts for multiple generations Output: Efficiently generated responses - Allocate fixed-size memory pages for attention cache - Dynamically assign pages to different requests - Reuse pages for multiple generations of same prompt - Deallocate pages when no longer needed
7. Vision Experiments
- NOTE: Describes preliminary work on integrating visual capabilities, including image and video recognition.
7.1. Data
7.2. Model Architecture
7.3. Model Scaling
7.4. Pre-training
7.5. Post-Training
7.6. Image Recognition Results
7.7. Video Recognition Results
8. Speech Experiments
- NOTE: Outlines experiments in adding speech understanding and generation capabilities to Llama 3.
8.1. Data
8.2. Model Architecture
8.3. Training Recipe
Input: Speech spectrogram Output: Trained speech encoder - Apply random masks to input spectrogram - Quantize masked regions into discrete tokens - Predict masked tokens using surrounding context - Use multiple cod
8.4. Speech Understanding Results
8.5. Speech Generation Results
9. Related Work
- NOTE: Provides context by discussing related work in language modeling and multimodal AI.
9.1. Language
- NOTE: Discusses advancements in large language models, including LLM scaling trends, LLM architectural innovations, and LLM open-source developments.
- Scale: Mentions the trend of increasing model size and compute, referencing works like PaLM LLM (Chowdhery et al., 2022) and Gopher LLM (Rae et al., 2021).
- Small LLM models: Highlights efficient smaller models like Phi (LLM) (Abdin et al., 2024).
- LLM architectures: Discusses innovations like mixture-of-experts models (Shazeer et al., 2017, Fedus et al., 2022).
- Open source: Mentions other open-source LLM models like Mistral (Jiang et al., 2023) and Falcon (Almazrouei et al., 2023).
- Post-training: Discusses instruction tuning and alignment techniques (Ouyang et al., 2022, Bai et al., 2022).
9.2. Multimodality
- NOTE: Explores recent developments in multimodal AI, focusing on vision-language models and speech-language integration.
- Vision-language models: Discusses works like CLIP (Radford et al., 2021), Flamingo (Alayrac et al., 2022), and GPT-4V (OpenAI, 2023).
- Video understanding: Mentions approaches like VideoChat (Li et al., 2023) and Video-LLaMA (Zhang, Li, Bing et al., 2023).
- Speech integration: References models like AudioPaLM (Rubenstein et al., 2023) and VioLA (Wang et al., 2023).
- Compositional multimodal approaches: Highlights methods for integrating multiple modalities, similar to the approach used in Llama 3.
10. Conclusion
- NOTE: Summarizes the key contributions of Llama 3 and discusses future directions for research.
11. Contributors and Acknowledgements
- NOTE: Lists the core contributors and acknowledges various individuals and teams involved in the development of Llama 3.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 TheLlama3HerdofModels | AI@Meta Llama Team | The Llama 3 Herd of Models | 2024 |