2025 DeepSeekR1IncentivizingReasonin
- (DeepSeek-AI, 2025) ⇒ DeepSeek-AI. (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.”
Subject Headings:
Notes
- The paper introduces DeepSeek-R1, a reasoning-focused language model developed through reinforcement learning (RL).
- The paper demonstrates two major innovations: DeepSeek-R1-Zero (pure RL without supervised fine-tuning) and DeepSeek-R1 (RL with cold-start data).
- The paper reports on their focus on outcome-based RL, where rewards are granted solely based on the correctness of the final answer (unlike OpenAI’s process-based RL that likely rewards each reasoning step).
- The paper shows DeepSeek-R1 achieves performance comparable to OpenAI's o1-1217, with 79.8% on AIME 2024 and 97.3% on MATH-500.
- The paper reveals how DeepSeek-R1-Zero developed emergent capabilities such as sophisticated behaviors like stepwise self-verification, counterfactual reflection and error backtracking, demonstrating meta-reasoning without explicit SFT supervision
- The paper documents challenges with DeepSeek-R1-Zero, including poor readability and language mixing, which were addressed in DeepSeek-R1.
- The paper details a multi-stage training pipeline combining cold-start data, RL, rejection sampling, and supervised fine-tuning.
- The paper demonstrates successful distillation of reasoning capabilities to smaller models, with the 32B version outperforming many larger models.
- The paper identifies limitations in engineering tasks, prompt sensitivity, and non-English/Chinese language handling.
- The paper provides extensive benchmark results across multiple domains including mathematics, coding, and general knowledge.
- The paper describes unsuccessful attempts with Process Reward Models and Monte Carlo Tree Search, offering insights for future research.
- The paper highlights the use of GRPO algorithm, a novel RL optimization technique that outperformed traditional PPO for reasoning tasks, enabling more efficient credit assignment in multi-step reasoning processes.
- The paper emphasizes the scalability of their RL framework, showing consistent performance gains across model sizes (1.5B to 670B), challenging the assumption that reasoning capability requires massive scale.
Cited By
Quotes
Abstract
We introduce our first-generation Reasoning Models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a base model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the Research Community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six Dense Models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
We introduce our first-generation Reasoning Models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a base model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the Research Community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six Dense Models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
1. Introduction of DeepSeek-R1
- "In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative rl fine-tuning."
2.[Major Innovations
- "DeepSeek-R1-Zero, a base model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities... To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL."
- "DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other base models."
- "One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behavior patterns as the test-time computation increases. Behavior Patterns such as reflection—where the base model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem solving arise spontaneously."
5. Challenges with R1-Zero
- "Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability and language mixing."
- "The pipeline system consists of four stages... cold start... reasoning-oriented reinforcement learning... rejection sampling and supervised fine-tuning... reinforcement learning for all scenarios."
- "As shown in Table 5, simply distilling DeepSeek-R1's outputs enables the efficient DeepSeek-R1-7B to outperform non-Reasoning Models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics."
8. Limitations identified
- "Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output... DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages."
- "We evaluate base models on MMLU, MMLU-Redux, MMLU-Pro, C-Eval, CMMLU, IFEval, FRAMES, GPQA Diamond, SimpleQA, C-SimpleQA, SWE-Bench Verified, Aider, LiveCodeBench, Codeforces, Chinese National High School Mathematics Olympiad (CNMO 2024), and American Invitational Mathematics Examination 2024 (AIME 2024)."
10. Failed Attempts
- "In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights... Process Reward Model (PRM)... Monte Carlo Tree Search (MCTS)."
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2025 DeepSeekR1IncentivizingReasonin | DeepSeek-AI | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | 2025 |