2024 MLEBenchEvaluatingMachineLearni
- (Chan, Chowdhury et al., 2024) ⇒ Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. (2024). “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.”
Subject Headings: MLE-Bench, Autonomous AI Agent Evaluation, Machine Learning Engineering Task Evaluation, AI System Benchmarking, Resource Utilization in AI, Scaling Effects in AI Models.
Notes
- The paper introduces a comprehensive framework for evaluating Autonomous AI Agents in complex Machine Learning Engineering (MLE) tasks using a benchmark of 75 curated Kaggle competitions.
- The paper designs and implements a novel Benchmark for AI Systems, measuring agent capabilities in training, debugging, and optimizing machine learning models.
- The paper explores the impact of Scaling and Resource Utilization on AI agents by varying time limits, compute resources, and evaluation strategies.
- The paper addresses the risks of Dataset Contamination and Overfitting by introducing obfuscation techniques and analyzing the correlation between model familiarity and task performance.
- The paper highlights the strengths and limitations of language models in Autonomous Code Generation and debugging, showing that even frontier models struggle with iterative refinement.
- The paper utilizes multiple Agent Scaffolding Techniques, such as AIDE and MLAB, to evaluate and optimize AI agents’ performance in structured competition environments.
- The paper conducts a Comparative Analysis of Language Models in MLE, demonstrating that OpenAI’s o1-preview outperforms GPT-4o, Llama 3.1, and Claude 3.5 on practical engineering tasks.
- The paper evaluates the role of AI in solving Data Science and ML Competitions by positioning AI agents as competitors in 75 high-impact Kaggle challenges.
- The paper discusses the Ethics and Safety in Autonomous AI Engineering, highlighting potential risks of rapidly advancing AI agents in industrial and research settings.
- The paper incorporates Plagiarism Detection and Model Integrity tools, such as Dolos, to ensure that agent solutions are not derived from pre-existing Kaggle notebooks.
- The paper underscores the potential for AI to Accelerate Scientific and Industrial Research by automating routine tasks in fields like healthcare, chemistry, and material science.
Cited By
Quotes
Abstract
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (this http URL) to facilitate future research in understanding the ML engineering capabilities of AI agents.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 MLEBenchEvaluatingMachineLearni | Lilian Weng Aleksander Mądry Jun Shern Chan Neil Chowdhury Oliver Jaffe James Aung Dane Sherburn Evan Mays Giulio Starace Kevin Liu Leon Maksin Tejal Patwardhan | MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering | 2024 |