2024 Qwen25CoderTechnicalReport
- (Hui, Yang et al., 2024) ⇒ Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin, et al. (2024). “Qwen2.5 Coder Technical Report.” doi:10.48550/arXiv.2409.12186
Subject Headings: Software Programming-Focused LLM, Qwen2.5 Coder LLM.
Notes
- Code LLM Data Mixing Ratios: The proportional balance of different data types in code LLM training. For example: 70% code : 20% text : 10% math optimal ratio. Notable insight: Higher code ratios don't necessarily improve performance; balance is crucial. Connected concepts: LLM Training Data Quality, Code-Specialized LLM Model Performance, LLM Model Domain Transfer.
- Repo-Level Code Understanding: Capability of LLMs to process and understand entire code repositories rather than just individual files. Involves special tokens and 128K context length support. Key implementation: File relationship tokens and repository structure awareness. Connected concepts: LLM Context Length, Code Repository Structure, Cross-File Code Completion.
- Code LLM Architecture Scaling:** Systematic patterns in how code LLM architectures change across different model sizes (0.5B-32B). Key patterns: Hidden size scaling, attention head ratios, and intermediate size relationships. Connected concepts: LLM Architecture Design, Model Parameter Scaling, Attention Head Configuration.
- Code Instruction Tuning Pipeline:** Multi-stage process for converting base code LLMs into instruction-following assistants. Includes synthetic data generation, checklist-based evaluation, and DPO alignment. Connected concepts: LLM Instruction Tuning, Direct Preference Optimization, Code Quality Assessment.
- Code Model Decontamination Strategy:** Methodical approach to prevent test set leakage in code LLM evaluation, using 10-gram overlap detection and benchmark filtering. Connected concepts: LLM Evaluation Integrity, Test Set Contamination, Code Benchmark Design.
- Three-Stage Code LLM Training:** Progressive training strategy moving from file-level to repo-level to instruction tuning, enabling comprehensive code understanding. Connected concepts: LLM Pretraining Strategy, Repo-Level Code Understanding, Code Instruction Tuning.
- Code Quality Validation Framework:** Multi-component system for assessing generated code quality through static analysis, runtime verification, and style checking. Connected concepts: Code Static Analysis, Runtime Verification, Code Style Assessment.
- Multilingual Code Generation Capability:** Design patterns enabling single LLM to generate code across 40+ programming languages with consistent quality. Connected concepts: Cross-Language Code Generation, Programming Language Tokens, Language-Specific Benchmarking.
- Code Context Length Extension:** Technical approaches for extending code LLM context from 8K to 128K tokens using RoPE and YARN mechanisms. Connected concepts: RoPE Position Embedding, YARN Scaling, Long Context Processing.
- Code LLM Evaluation Matrix:** Comprehensive framework for assessing code LLM capabilities across generation, completion, reasoning, and editing tasks. Connected concepts: Code Generation Metrics, Code Completion Assessment, Code Reasoning Evaluation.
Cited By
Quotes
Abstract
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.
References
;