2024 TheAgentCompanyBenchmarkingLLMA
- (Xu et al., 2024) ⇒ Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. (2024). “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks.” doi:10.48550/arXiv.2412.14161
Subject Headings: TheAgentCompany, Agentic AI Benchmarking, Agentic AI Evaluation, Agentic AI-based System.
Notes
- LLM Agent Multi-Platform API Authentication: How TheAgentCompany manages agent authentication across GitLab, RocketChat, OwnCloud, and Plane simultaneously. The paper demonstrates patterns for secure token management, session persistence, and handling multiple authentication contexts within a single agent execution environment.
- Checkpoint-Based LLM Task Success Metrics: The paper's novel scoring system combining weighted partial completion (50%) with full task completion bonus (50%). Includes specific implementation of checkpoint validation, progression tracking, and the mathematical formula for calculating final scores (Spartial = 0.5 · Result/Total + 0.5 · Sfull).
- Enterprise Tool Simulation Using Docker Compose: TheAgentCompany's architecture for creating a reproducible corporate environment using containerized services. Details container networking, data persistence layers, and reset mechanisms for maintaining consistent test conditions across experiments.
- LLM Token Cost vs Task Completion Analysis: Comprehensive analysis of token usage across different models (Claude, GPT-4, Gemini) in relation to task success rates. Provides concrete cost calculations, step counts, and completion percentages for various task types.
- Agent UI Navigation: Implementation details for robust web interaction using Playwright in corporate environments. Includes handling of dynamic UI elements, popup management, and error recovery strategies demonstrated through concrete task examples.
- Corporate NPC Simulation: Detailed framework for implementing realistic corporate personas using Claude-3 as the backbone. Shows specific prompt engineering techniques, conversation state management, and role-based knowledge boundaries.
- Software Company Task Taxonomy for LLM Testing: Structured categorization of 175 corporate tasks across SDE, PM, HR, and Finance domains. Includes classification criteria, complexity metrics, and systematic task design principles based on the O*NET database.
- LLM Agent Browser-Terminal Integration: Technical patterns for combining web browsing, terminal operations, and Python execution in a single agent. Details state management, context switching, and error handling between different interaction modes.
- Real-World LLM Benchmark Design: Methodology for creating reproducible corporate environment benchmarks. Covers data generation, task creation workflow, and implementation of realistic constraints based on actual corporate practices.
- LLM Agent Failure Analysis: Systematic approach to categorizing and analyzing agent failures in corporate tasks. Includes specific failure patterns (UI navigation, social interaction, technical execution), frequency analysis, and correlation with task types.
Cited By
Quotes
Abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 TheAgentCompanyBenchmarkingLLMA | Graham Neubig Frank F. Xu Yufan Song Boxuan Li Yuxuan Tang Kritanjali Jain Mengxue Bao Zora Z. Wang Xuhui Zhou Zhitong Guo Murong Cao Mingyang Yang Hao Yang Lu Amaad Martin Zhe Su Leander Maben Raj Mehta Wayne Chi Lawrence Jang Yiqing Xie Shuyan Zhou | TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks | 10.48550/arXiv.2412.14161 | 2024 |