2025 LLMPostTrainingADeepDiveIntoRea
- (Kumar et al., 2025) ⇒ Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Salman Khan, and Fahad Shahbaz Khan. (2025). “LLM Post-Training: A Deep Dive Into Reasoning Large Language Models.” doi:10.48550/arXiv.2502.21321
Subject Headings: LLM Training Task, LLM Training Method, LLM Training System.
Notes
- Post-Training LLM Algorithm Taxonomy: The paper establishes a clear taxonomy of post-training algorithms (Figure 1), demonstrating how LLM training algorithms extend beyond initial pre-training to include fine-tuning (SFT), reinforcement learning (PPO, DPO, GRPO), and test-time scaling—showcasing the complete optimization lifecycle for LLM parameters.
- Parameter-Efficient Training Algorithms: The paper's coverage of LoRA, QLoRA, and adapter methods (Section 4.7 and Table 2) illustrates how modern LLM training algorithms can optimize selective subsets of parameters rather than all weights, directly confirming the categorization of "Parameter-Efficient Training Algorithms."
- Reinforcement Learning for Sequential Decision-Making: The paper's explanation of how RL algorithms (Sections 3.1-3.2) adapt to token-by-token generation frames LLM training as a sequential decision process with specialized advantage functions and credit assignment mechanisms—extending beyond the traditional gradient descent approaches.
- Process vs. Outcome Reward Optimization: The comparison between Process Reward Models and Outcome Reward Models (Sections 3.1.3-3.1.4) demonstrates a unique aspect of LLM training algorithms: optimization can target either intermediate reasoning steps or final outputs.
- Hybrid Training-Inference Algorithms: The paper's extensive coverage of test-time scaling methods (Section 5) reveals that modern LLM training algorithms can span the traditional training-inference boundary, with techniques like Monte Carlo Tree Search and Chain-of-Thought representing algorithmic approaches that continue model optimization during deployment.
Cited By
Quotes
Abstract
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: this https URL.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2025 LLMPostTrainingADeepDiveIntoRea | Komal Kumar Tajamul Ashraf Omkar Thawakar Rao Muhammad Anwer Hisham Cholakkal Mubarak Shah Ming-Hsuan Yang Phillip H. S. Torr Salman Khan Fahad Shahbaz Khan | LLM Post-Training: A Deep Dive Into Reasoning Large Language Models | 10.48550/arXiv.2502.21321 | 2025 |