Modern proprietary LLMs are born with the creativity from digesting world knowledge. They achieved great success in content generation. When it comes to building agents using these LLMs, the reliability in workflow task execution still remains a burning pain for adoption. 

We are excited to see that in a recent rigorous third-party research benchmark [1], NexusRaven-V2 function calling LLM attains 0 tool-use hallucination cases across the 840 test samples, outperforming GPT-3.5 turbo and GPT-4 which hallucinate in 50 and 23 cases respectively. In this blogpost, we will explain the design principles for NexusRaven-V2 to attain such superior reliability.

The reliability dilemma for conventional proprietary LLMs

Across many GenAI agent applications, we often simultaneously observe the following two desiderata which are seemingly in tension.

  • Extractive reasoning in tool-use can enable reliable synthesis of the explicit intention and information presented in the user input, and then leverage tools (e.g. browser or domain specific software) to accomplish tasks.
  • Abstractive reasoning in content generation allows creative outcomes from world knowledge beyond the information presented in user input, such as writing poems and producing engaging chitchat.

In the conventional agent design paradigm, a single proprietary LLM powers both tool-use and content generation for agents. It naturally imposes a tension between reliable synthesis and creative outcomes for the underlying LLM. This tension comes from the fact that abstractive reasoning for creative outcome tends to produce hallucinations, which harm reliable synthesize.

Figure 1. (Left) Conventional agent paradigm uses a single proprietary LLM for tool use and content generation, requiring reliable synthesize and creative outcome respectively. As creative outcome requires abstractive reasoning which tends to produce hallucinations, it imposes a challenge on reliable synthesis for the underlying LLM. (Right) NexusRaven-V2 powers agents with tool-use as the overarching task optimized for reliability while content generation is considered as a capability from tools.

NexusRaven-V2: Supercharging reliability for tool-use

To break out the dilemma, we advocate that tool-use should become the overarching orchestration task. In the meanwhile, content generation is offloaded to tools (may or may not be LLMs) under orchestration. As shown in Figure 2, we leverage large-scale tuning to optimize NexusRaven-V2 for extractive reasoning in reliable tool use (more specifically, in the form of function calling). This opens the opportunity to attain a superior tool use reliability at two magnitudes smaller model size than GPT-4.

Figure 2. NexusRaven-V2 is tuned in large scale to optimize for reliable tool use (function calling), outperforming GPT-4 on tool use reliability at two magnitudes smaller model size.

Validation in third-party independent research

In a recent third party research regarding the robustness of function calling LLMs for agent applications, NexusRaven-V2 LLM demonstrated leading reliability in turning human instructions into software tool operations. We refer readers to the appendix on the details of the performance measurements.

  • Attains Zero case of hallucination out of 840 test samples on determining what tools to leverage and how to use them. This outperforms both GPT-3.5 with 50 hallucinations and GPT-4 with 23 hallucinations.
  • Outperforms GPT-4 by 9% higher success rate in information seeking applications with technical tools (e.g. Financial transaction understanding,  real-time search and etc.) that requires great attention and faithfulness to details.
  • Outperforms GPT-4 by 4% higher success rate in adversarial settings requiring strong comprehension of tool documentations, including cases with uninformative tool and API argument names often observed in practice.

NexusRaven-V2 is one piece of the puzzle in the stack built in Nexusflow. We want to thank the RoTBenchmark team for their great effort on the rigorous robustness evaluation on tool-use capabilities. We are excited to work with the community to push the boundary of reliable agents for workflows.

Reference

[1] Ye et. al. RoTBench: A Multi-Level Benchmark for Evaluation the Robustness of Large Language Models in Tool Learning

Appendix

Zero tool-use hallucination

  NexusRaven-V2 GPT-3.5 Turbo GPT-4
# of tool-use hallucination cases 0 50 23
Table 1. NexusRaven-V2 attains 0 tool-use hallucination in 840 test cases. This is compared to 50 and 23 hallucination cases for GPT-3.5 turbo and GPT-4. Derived from Table 6 in [1].

Information seeking with technical tools

  NexusRaven-V2 GPT-4
Information Retrieval 68.22 46.22
Application Manipulation 47.33 42.89
Financial Transaction Processing 53.11 44.89
Real-Time Search 52.67 51.11
Average 55.33 46.28
Table 2. NexusRaven-V2 outperforms GPT-4 in information seeking applications using technical tools, on average across all noise perturbation levels. Operating these tools requires faithful attention to details. Derived from Table 19, 18, 16, 15 in [1].

Cases requiring strong tool document comprehension

  NexusRaven-V2 GPT-4
Tool Selection 62.38 60.00
Parameter Identification 37.62 32.86
Content Filling 30.00 25.24
Average 43.33 39.36
Table 3. NexusRaven-V2 outperforms GPT-4 on the heavy noising level across all scenarios. Operating under heavy noise of the tools requires a strong understanding of the tool documentation, as the function names are perturbed. Derived from Table 2 in [1] by averaging heavy noising performance across Tool Selection, Parameter Identification, and Content Filling.