Huan Wang, Shelby Heinecke, Juan Carlos Niebles, Caiming Xiong
TL;DR: We release xLAM, a series of LLMs optimized for function calling and AI Agents. It offers several variants designed to serve different application domains, from mobile usage to high-demand performance contexts. They show competitive performance across various key agent benchmarks.
Large Language Models in the Agent Environment
In a traditional Reinforcement Learning (RL) framework, the notion of the “Agent” plays a key role. This framework comprises pivotal concepts such as:
- Environment: Accepts a sequence of actions from agents as input, and in turn, provides them with rewards and observations. Numerous environments consist of states and execute state transitions hinging on the actions executed by the agents.
- Agents: These entities receive rewards and observations from the environments and subsequently produce actions. Most agents also incorporate states and facilitate state transitions based on the actions they engage in.
The advent of Large Language Models (LLMs) soon led to their application in agent-related scenarios. It was discovered that with the right prompting, an LLM could generate structured text with high probability. Since the output is structured, it can be readily parsed into callable functions/actions. Particularly, if the environment can be depicted using text, all the observations and rewards can be encapsulated within the prompt. Instead of the conventional RL agent modeling the conditional action distributions, a generic LLM combined with an output parser could be employed to determine the next action.
Large Language Models Optimized for Function-Calling
Function-calling poses one of the most sought-after agent applications, where the agent is tasked with completing user commands via a series of function calls. These can typically involve a wide array of potential functions/APIs that might be utilized to help fulfill the user’s requirements. Each of these functions/APIs possesses distinct descriptions, arguments, and returns. The applicable functions are presented to the LLMs in the prompt, and the LLMs then opt for the appropriate functions based on the context and specific objective, choose the corresponding arguments, and obtain the output from the chosen functions.
The widespread appeal of function-calling applications necessitates enhanced LLMs. However, the downside is that generic LLMs are not specifically tailored to cater for function calling contexts. To address this, we have compiled one of the most extensive collections of function-calling environments and data, ensuring a uniform format across all datasets. The idea is, as more data from various function-calling environments are used to train any base foundation model, the model should, in theory, be able to adapt to unseen function calling environments.
xLAM: A Solution For All
We are launching three variants of xLAM to cater to different scenarios:
- xLAM-7b-r: A 7b model designed for swift academic exploration with limited GPU resources.
- xLAM-8x7b-r: An 8x7b mixture-of-experts model, ideal for industrial applications striving for a balanced combination of latency, resource consumption, and performance.
- xLAM-8x22b-r: this is a large mixture-of-experts model if you have great computational resources and want to pursue the best performance.
These three variants of xLAM models are designed for both single-turn and multi-turn application scenarios across diverse environments and benchmarks. Previously we released two versions of our xLAM models specifically trained for the single-turn function-calling tasks: xLAM-1b-fc-r and xLAM-7b-fc-r. Notably, xLAM-7b-fc-r held the second place on the previous Berkely Function Calling Leaderboard V1. Currently it is ranked #16 on the Berkeley Function Calling Leaderboard V2 Live. Its compact counterpart, xLAM-1b-fc-r, nicknamed the ‘Tiny Giant,’ features just 1 billion parameters, making it ideal for mobile applications.
Results
Our xLAM series models perform consistently on several key benchmarking environments including ToolBench, Berkeley Function Calling Leaderboard, Webshop, and AgentBoard. Below is an overview of the results:
Due to the recent service interruption from ToolBench, the results of xLAM-8x22b-r on ToolBench are yet to come. Still it is clear to see the xLAM model series offer pretty comparable performance compared to OpenAI’s GPT models with a much smaller model size. Notably, the xLAM-8x22b-r model tops the Berkeley Function Calling Leaderboard in our evaluation.
Model Performance vs. Size on the Berkeley Function Calling Leaderboard
While smaller models can perform adequately in specific scenarios, we’ve observed that larger models generally achieve better overall accuracy. This improved performance, however, comes at the cost of increased computational resources and latency. Detailed performance results can be found in our arXiv paper.
Technology
Data and Environments
Our function calling research assembled a diverse set of datasets, including notable ones such as ToolBench, Webshop, ToolApaca, HotpotQA, AlfWorld, APIBank, Mind2Web, AgentBoard, and AgentBench. In addition to these existing function-calling datasets, we enriched our data pool with synthetic data such as API-GEN, SpecTools (coming soon), and an in-house large action model simulation (LAM Simulator) environment.
Unified Format
All the datasets are combined and further unified using a consistent format for the model fine-tuning. For a more consistent performance, we proposed an unified format and processed all the data into the proposed format before the model training.
Data Augmentation and Purging
The unified data format is critical in the systematic augmentation and purification of our large-scale data set. Using a meticulously crafted augmentation pipeline, we enhanced our raw data through several techniques including Instruction and Format Diversification, Special Token Augmentation, Order/Sequence Shuffling, and Data Synthesis from Failure Case Analysis.
Through our experimentation, we have discerned that the mere volume of data does not guarantee superior model performance. Instead, the quality of the data is a pivotal factor. Datasets such as ToolBench, despite their extensive data quantity, are often noisy. Indiscriminately integrating these noisy data sets into the training phase can significantly diminish the data quality. For maintaining superior data integrity and control, we employed OpenAI’s GPT models to purge samples from datasets beset by potential data quality issues.
Data Synthesis
Beyond existing environments and datasets, we constructed several pipelines and environments to generate synthetic data, which is then used to further enrich the function-calling training set. Along this line of efforts, we release multiple papers/reports including
- APIGEN: an automated data generation pipeline designed to produce verifiable high-quality datasets for function-calling applications.
- xLAM-Simulator [coming soon]: our simulator for language agents focusing on tools use that can provide feedback to plans/actions.
- SpecTool [coming soon]: a small-scale benchmark to identify error patterns in LLM output on tool-use tasks.
Conclusion
We are excited to unveil a collection of large action models, named xLAM, designed to cater a diverse range of application scenarios, spanning from mobile devices to extreme performance-demanding applications. Our models show competitive performance across various major agent benchmarks. Alongside these models, we are providing comprehensive technical reports on the model training, evaluation, synthetic data generation, as well as detailed insights from our new environments and simulations.
Acknowledgment
Full author list: Jianguo Zhang∗, Tian Lan∗, Ming Zhu∗, Zuxin Liu∗, Thai Hoang∗, Shirley Kokane†, Weiran Yao†, Juntao Tan, Akshara Prabhakar, Zhiwei Liu, Haolin Chen, Yihao Feng,Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Resources:
- Technical Reports:
- xLAM Technical Report: https://arxiv.org/abs/2409.03215
- xLAM Technical Report: https://arxiv.org/abs/2409.03215
- APIGEN: https://arxiv.org/abs/2406.18518
- Github: https://github.com/SalesforceAIResearch/xLAM
- Webpage: https://www.salesforceairesearch.com/projects/xlam-large-action-models
- xLAM Intro Blog: https://blog.salesforceairesearch.com/xlam-large-action-models/
- Models on HuggingFace: https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4
- Synthetic Data Generated by APIGEN: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k
- Related work:
- AgentOhana: paper
- Multi-agent Ensembles on SWEBench: paper, web
- AgentLite: paper, code