2018 ASurveyofMachineLearningforBigC
- (Allamanis et al., 2018) ⇒ Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. (2018). “A Survey of Machine Learning for Big Code and Naturalness.” In: ACM Computing Surveys (CSUR) Journal, 51(4). doi:10.1145/3212695
Subject Headings: Code Auto-Completion Task; Naturalness Hypothesis.
Notes
Cited By
- Google Scholar:~ 111 Citations (Retrieved 2019-10-11).
- ACM DL: ~ 16 Citations (Retrieved 2019-10-11).
- Semantic Scholar: ~ 93 Citations (Retrieved 2019-10-11).
Quotes
Abstract
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
1. Introduction
2. The Naturalness Hypothesis
3. Text, Code And Machine Learning
4. Probabilistic Models Of Code
4.1 Code—generating Probabilistic Models of Source Code
4.2 Representational Models of Source Code
4.3 Pattern Mining Models of Source Code
5. Applications
5.1 Recommender Systems
5.2 Inferring Coding Conventions
5.3 Code Defects
5.4 Code Translation, Copying, and Clones
5.5 Code to Text and Text to Code
5.6 Documentation, Traceability and Information Retrieval
5.7 Program Synthesis
5.8 Program Analysis
6. Challenges And Future Directions
6.1 The Third Wave of Machine Learning
6.2 New Domains
(...)
Code Completion and Synthesis. Code completion and synthesis using machine learning are two heavily researched and interrelated areas. Despite this fact, to our knowledge, there has been no full scale comparison between LM-based [87, 144, 166] and structured prediction-based autocompletion models [33, 159]. Although both types of systems target the same task, the lack of a well-accepted benchmark, evaluation methodology and metrics has lead to the absence of a quantitative comparison that highlights the strengths and weaknesses of each approach. This highlights the necessity of widely accepted, high-quality benchmarks, shared tasks, and evaluation metrics that can lead to comparable and measurable improvements to tasks of interest. NLP and computer vision follow such a paradigm with great success[1].
Omar et al. [149] discuss the challenges that arise from the fact that program editors usually deal with incomplete, partial programs. Although they discuss how formal semantics can extend to these cases, inherently any reasoning about partial code requires reasoning about the programmer’s intent. Lu et al. [125] used information-retrieval methods for synthesizing code completions showing that simply retrieving snippets from “big code” can be useful when reasoning about code completion, even without a learnable probabilistic component. This suggests a fruitful area for probabilistic models of code that can assist editing tools when reasoning about incomplete code’s semantics, by modeling how code could be completed.
(...)
7. Related Research Areas
8. Conclusions
Probabilistic models of source code have exciting potential to support new tools in almost every area of program analysis and software engineering. We reviewed existing work in the area, presenting a taxonomy of probabilistic machine learning source code models and their applications. The reader may appreciate that most of the research contained in this review was conducted within the past few years, indicating a growth of interest in this area among the machine learning, programming languages and software engineering communities. Probabilistic models of source code raise the exciting opportunity of learning from existing code, probabilistically reasoning about new source code artifacts and transferring knowledge between developers and projects.
Footnotes
- ↑ See https://qz.com/1034972/ for a popular account of the effect of large-scale datasets in computer Vision.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2018 ASurveyofMachineLearningforBigC | Charles Sutton Miltiadis Allamanis Earl T. Barr Premkumar Devanbu | A Survey of Machine Learning for Big Code and Naturalness | 10.1145/3212695 | 2018 |