2024 TheLlama3HerdofModels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Llama LLM, Llama 3.1 LLM.

Notes

Cited By

Quotes

Abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Certainly. I'll add a brief summary note after each main section. Here's the updated table of contents with notes:

1. Introduction

  • NOTE: Introduces Llama 3, a new family of foundation models for language, including variants with 8B, 70B, and 405B parameters.

2. General Overview

  • NOTE: Outlines the two-stage approach of pre-training and post-training, and introduces the compositional approach for multimodal capabilities.

2.1. Language Model Pre-training

2.2. Language Model Post-training

3. Pre-Training

3.1. Pre-Training Data

3.2. Model Architecture

  Input: Query, Key, and Value tensors; number of query heads and key-value heads
  Output: Attention output tensor
  - Split queries into more heads than keys and values
  - For each query head:
    - Match it with a key-value head (cycling if necessary)
    - Compute attention scores and output
  - Concatenate outputs from all heads

3.3. Infrastructure, Scaling, and Efficiency

  Input: Model, input data, number of pipeline stages
  Output: Model outputs and gradients
  - Divide model into stages across devices
  - For each microbatch:
    - Forward pass through stages sequentially
    - Pass output of each stage to next device
  - Collect final outputs from last stage
  - Perform backward pass in reverse order
  Input: Large tensor operation, number of devices
  Output: Result of the distributed tensor operation
  - Split large tensors across multiple devices
  - Perform local computations on each device
  - Synchronize results across devices (e.g., all-reduce)
  Input: Long input sequence, number of devices
  Output: Processed sequence with full context
  - Divide input sequence into chunks
  - Process chunks on different devices in parallel
  - Synchronize for operations requiring full context
  Input: Model, batch of data, number of devices
  Output: Updated model parameters
  - Shard model parameters, gradients, and optimizer states
  - During forward pass:
    - Gather full parameters for current layer
    - Compute and release immediately after use
  - In backward pass:
    - Recompute forward activations as needed
    - Accumulate gradients locally
  - Synchronize gradients across all devices

3.4. Training Recipe

3.1. Pre-Training Data

3.2. Model Architecture

3.3. Infrastructure, Scaling, and Efficiency

3.4. Training Recipe

4. Post-Training

  • NOTE: Describes the post-training process, including modeling approaches, data preparation, and development of specific capabilities.

4.1. Modeling

  Input: Model, preferred and non-preferred response pairs
  Output: Aligned model
  - Given: preferred and non-preferred responses
  - Maximize likelihood ratio of preferred over non-preferred
  - Update model to align with human preferences
  Input: Prompt, language model, reward model
  Output: High-quality response
  - Generate multiple responses for a prompt
  - Score responses using a reward model
  - Select highest-scoring response
  - Add selected response to training data
  Input: Problem state, computation budget
  Output: Best action or reasoning step
  - Start with root problem state
  - While within computation budget:
    - Select: Traverse tree based on UCB scores
    - Expand: Add new node if not terminal
    - Simulate: Random rollout to end state
    - Backpropagate: Update node statistics
  - Choose best action based on visit counts

4.2. Post-training Data

4.3. Capabilities

5. Results

  • NOTE: Presents comprehensive evaluation results for both pre-trained and post-trained models, including human evaluations and safety assessments.

5.1. Pre-trained Language Model

5.2. Post-trained Language Model

5.3. Human Evaluations

5.4. Safety

   Input: Target model, diversity dimensions
   Output: Diverse set of adversarial prompts
   - Define diversity dimensions for prompts
   - Generate initial population of prompts
   - Evaluate prompts against target model
   - Select best prompts in each dimension
   - Mutate and crossover to create new prompts
   - Repeat evaluation and evolution process

6. Inference

  • NOTE: Discusses techniques for efficient inference, including pipeline parallelism and FP8 quantization.

6.1. Pipeline Parallelism

6.2. FP8 Quantization

   Input: Full-precision model
   Output: Quantized model for efficient inference
   - Convert model weights to 8-bit floating point
   - During inference:
     - Dequantize weights to higher precision
     - Perform computation
     - Requantize results if necessary
   - Use dynamic scaling for better accuracy
   Input: Prompts for multiple generations
   Output: Efficiently generated responses
   - Allocate fixed-size memory pages for attention cache
   - Dynamically assign pages to different requests
   - Reuse pages for multiple generations of same prompt
   - Deallocate pages when no longer needed

7. Vision Experiments

  • NOTE: Describes preliminary work on integrating visual capabilities, including image and video recognition.

7.1. Data

7.2. Model Architecture

7.3. Model Scaling

7.4. Pre-training

7.5. Post-Training

7.6. Image Recognition Results

7.7. Video Recognition Results

8. Speech Experiments

  • NOTE: Outlines experiments in adding speech understanding and generation capabilities to Llama 3.

8.1. Data

8.2. Model Architecture

8.3. Training Recipe

  Input: Speech spectrogram
  Output: Trained speech encoder
  - Apply random masks to input spectrogram
  - Quantize masked regions into discrete tokens
  - Predict masked tokens using surrounding context
  - Use multiple cod

8.4. Speech Understanding Results

8.5. Speech Generation Results

9. Related Work

9.1. Language

9.2. Multimodality

10. Conclusion

  • NOTE: Summarizes the key contributions of Llama 3 and discusses future directions for research.

11. Contributors and Acknowledgements

  • NOTE: Lists the core contributors and acknowledges various individuals and teams involved in the development of Llama 3.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 TheLlama3HerdofModelsAI@Meta Llama TeamThe Llama 3 Herd of Models2024