Modern proprietary LLMs are born with the creativity from digesting world knowledge. They achieved great success in content generation. When it comes to building agents using these LLMs, the reliability in workflow task execution still remains a burning pain for adoption.
We are excited to see that in a recent rigorous third-party research benchmark [1], NexusRaven-V2 function calling LLM attains 0 tool-use hallucination cases across the 840 test samples, outperforming GPT-3.5 turbo and GPT-4 which hallucinate in 50 and 23 cases respectively. In this blogpost, we will explain the design principles for NexusRaven-V2 to attain such superior reliability.
The reliability dilemma for conventional proprietary LLMs
Across many GenAI agent applications, we often simultaneously observe the following two desiderata which are seemingly in tension.
- Extractive reasoning in tool-use can enable reliable synthesis of the explicit intention and information presented in the user input, and then leverage tools (e.g. browser or domain specific software) to accomplish tasks.
- Abstractive reasoning in content generation allows creative outcomes from world knowledge beyond the information presented in user input, such as writing poems and producing engaging chitchat.
In the conventional agent design paradigm, a single proprietary LLM powers both tool-use and content generation for agents. It naturally imposes a tension between reliable synthesis and creative outcomes for the underlying LLM. This tension comes from the fact that abstractive reasoning for creative outcome tends to produce hallucinations, which harm reliable synthesize.
NexusRaven-V2: Supercharging reliability for tool-use
To break out the dilemma, we advocate that tool-use should become the overarching orchestration task. In the meanwhile, content generation is offloaded to tools (may or may not be LLMs) under orchestration. As shown in Figure 2, we leverage large-scale tuning to optimize NexusRaven-V2 for extractive reasoning in reliable tool use (more specifically, in the form of function calling). This opens the opportunity to attain a superior tool use reliability at two magnitudes smaller model size than GPT-4.
Validation in third-party independent research
In a recent third party research regarding the robustness of function calling LLMs for agent applications, NexusRaven-V2 LLM demonstrated leading reliability in turning human instructions into software tool operations. We refer readers to the appendix on the details of the performance measurements.
- Attains Zero case of hallucination out of 840 test samples on determining what tools to leverage and how to use them. This outperforms both GPT-3.5 with 50 hallucinations and GPT-4 with 23 hallucinations.
- Outperforms GPT-4 by 9% higher success rate in information seeking applications with technical tools (e.g. Financial transaction understanding, real-time search and etc.) that requires great attention and faithfulness to details.
- Outperforms GPT-4 by 4% higher success rate in adversarial settings requiring strong comprehension of tool documentations, including cases with uninformative tool and API argument names often observed in practice.
NexusRaven-V2 is one piece of the puzzle in the stack built in Nexusflow. We want to thank the RoTBenchmark team for their great effort on the rigorous robustness evaluation on tool-use capabilities. We are excited to work with the community to push the boundary of reliable agents for workflows.
Reference
[1] Ye et. al. RoTBench: A Multi-Level Benchmark for Evaluation the Robustness of Large Language Models in Tool Learning
Appendix
Zero tool-use hallucination
NexusRaven-V2 | GPT-3.5 Turbo | GPT-4 | |
---|---|---|---|
# of tool-use hallucination cases | 0 | 50 | 23 |
Information seeking with technical tools
NexusRaven-V2 | GPT-4 | |
---|---|---|
Information Retrieval | 68.22 | 46.22 |
Application Manipulation | 47.33 | 42.89 |
Financial Transaction Processing | 53.11 | 44.89 |
Real-Time Search | 52.67 | 51.11 |
Average | 55.33 | 46.28 |
Cases requiring strong tool document comprehension
NexusRaven-V2 | GPT-4 | |
---|---|---|
Tool Selection | 62.38 | 60.00 |
Parameter Identification | 37.62 | 32.86 |
Content Filling | 30.00 | 25.24 |
Average | 43.33 | 39.36 |