2023 APIBankAComprehensiveBenchmarkf

(Li, Zhao et al., 2023) ⇒ Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. (2023). “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.” doi:10.48550/arXiv.2304.08244

Subject Headings: Tool-Augmented LLM.

Notes

This research paper introduces API-Bank, a comprehensive benchmark for evaluating and enhancing the capabilities of tool-augmented large language models (LLMs).
1. The main points of the paper are: *# Motivation: The authors aim to address three key questions regarding tool-augmented LLMs:
  1. How effective are current LLMs at utilizing tools?
  2. How can we enhance LLMs' ability to utilize tools?
  3. What obstacles need to be overcome for LLMs to effectively leverage tools?
2. Design Principles: Based on interviews with 500 users, the authors establish design principles for API-Bank, focusing on the LLMs' abilities to plan, retrieve, and call API tools. They also emphasize the importance of domain diversity, API diversity, API authenticity, and evaluation authenticity.
3. Evaluation System: The authors implement an evaluation system consisting of 73 API tools and 314 manually annotated tool-use dialogues with 753 API calls. This system assesses the existing LLMs' capabilities in planning, retrieving, and calling APIs.
4. Training Set: To enhance LLMs' tool utilization abilities, the authors construct a training set containing 1, 888 tool-use dialogues from 2, 138 APIs spanning 1, 000 distinct domains. They introduce a novel multi-agent method using LLMs to automatically generate this training data, reducing annotation costs by 98%.
5. Experiments: The authors fine-tune a model called Lynx using the API-Bank training set. Experiments show that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. Lynx surpasses Alpaca's performance by more than 26 points and approaches the effectiveness of GPT-3.5. Error analysis highlights the key challenges for future research in this field.
6. Contributions: API-Bank is the most comprehensive benchmark currently available for tool-augmented LLMs, encompassing the highest diversity of domains and APIs, realistic multi-turn dialogues, and thorough coverage of essential tool usage abilities.

Cited By

http://scholar.google.com/scholar?q=%222023%22+API-Bank%3A+A+Comprehensive+Benchmark+for+Tool-Augmented+LLMs

Quotes

Abstract

Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1, 888 tool-use dialogues from 2, 138 APIs spanning 1, 000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 APIBankAComprehensiveBenchmarkf	Fei Huang Zhoujun Li Minghao Li Yingxiu Zhao Bowen Yu Feifan Song Hangyu Li Haiyang Yu Yongbin Li			API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs				10.48550/arXiv.2304.08244		2023