2025 DeepSeekR1IncentivizingReasonin

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

  1. The paper introduces DeepSeek-R1, a reasoning-focused language model developed through reinforcement learning (RL).
  2. The paper demonstrates two major innovations: DeepSeek-R1-Zero (pure RL without supervised fine-tuning) and DeepSeek-R1 (RL with cold-start data).
  3. The paper reports on their focus on outcome-based RL, where rewards are granted solely based on the correctness of the final answer (unlike OpenAI’s process-based RL that likely rewards each reasoning step).
  4. The paper shows DeepSeek-R1 achieves performance comparable to OpenAI's o1-1217, with 79.8% on AIME 2024 and 97.3% on MATH-500.
  5. The paper reveals how DeepSeek-R1-Zero developed emergent capabilities such as sophisticated behaviors like stepwise self-verification, counterfactual reflection and error backtracking, demonstrating meta-reasoning without explicit SFT supervision
  6. The paper documents challenges with DeepSeek-R1-Zero, including poor readability and language mixing, which were addressed in DeepSeek-R1.
  7. The paper details a multi-stage training pipeline combining cold-start data, RL, rejection sampling, and supervised fine-tuning.
  8. The paper demonstrates successful distillation of reasoning capabilities to smaller models, with the 32B version outperforming many larger models.
  9. The paper identifies limitations in engineering tasks, prompt sensitivity, and non-English/Chinese language handling.
  10. The paper provides extensive benchmark results across multiple domains including mathematics, coding, and general knowledge.
  11. The paper describes unsuccessful attempts with Process Reward Models and Monte Carlo Tree Search, offering insights for future research.
  12. The paper highlights the use of GRPO algorithm, a novel RL optimization technique that outperformed traditional PPO for reasoning tasks, enabling more efficient credit assignment in multi-step reasoning processes.
  13. The paper emphasizes the scalability of their RL framework, showing consistent performance gains across model sizes (1.5B to 670B), challenging the assumption that reasoning capability requires massive scale.


Cited By

Quotes

Abstract

We introduce our first-generation Reasoning Models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a base model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the Research Community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six Dense Models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

We introduce our first-generation Reasoning Models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a base model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the Research Community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six Dense Models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

1. Introduction of DeepSeek-R1

2.[Major Innovations

3. Performance Comparison

4. Behavior Development

5. Challenges with R1-Zero

6. Multi-Stage Pipeline

7. Model Distillation

8. Limitations identified

9. Benchmark Categorys

10. Failed Attempts

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2025 DeepSeekR1IncentivizingReasoninDeepSeek-AIDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning2025