We are thrilled to open source NexusRaven-V2, a 13B LLM outperforming GPT-4 in zero-shot function calling, the capability to turn natural language instructions into executable code to use tools. The function calling capability lies at the core of the OpenAI Assistants API, and serves as the key to enabling copilots and agents to use software tools. With the goal of advancing open source models for copilots and agents, NexusRaven-V2 marks an exciting step in collaboration with the community to expand the open model ecosystem for technological and societal impacts. The highlights of this release include:

  • State-of-the-art & Generalizable Capability. NexusRaven-V2 surpasses GPT-4 by up to 7% in function calling success rates in human-generated use cases involving nested and composite functions. NexusRaven-V2 has never been trained on the functions used in evaluation.
  • Open-source and Commercially Permissive. NexusRaven-V2 is further instruction-tuned on Meta's CodeLlama-13B-instruct, leveraging curated data generated through Nexusflow's pipeline, exclusively sourced from open-code corpora without using proprietary LLMs. It is commercially permissive for both community developers and enterprises.
  • Ease of Integration. We release open-source utility artifacts that enable users to seamlessly replace mainstream proprietary function calling APIs with NexusRaven-V2 in their software workflow. We also provide online demos and Colab notebooks for onboarding and integration demonstration.
  • Function Calling Benchmark and Leaderboard. We open source our evaluation benchmark Nexus-Function-Calling and establish a Huggingface leaderboard which includes an extensive collection of real-life human-curated function-calling examples, covering a diverse range of function-calling use cases and difficulties. These hundreds of examples, across 9 tasks, have been curated with input from domain experts, and their ground truth has been meticulously checked. We open source 8 out of the 9 benchmarks, leaving one as an internal benchmark for testing new models.

Checkout our model, leaderboard on Huggingface and code on Github. 

Interact with NexusRaven-V2 with Colab notebook and our application demo.

Join the community on Discord.

Copilots Using Tools
Figure 1: NexusRaven-V2 provides the function calling capability to enable copilots and agents to use software tools. Given human instruction prompts and software documentations, the function calling capability generates executable code to run the functions/APIs.

Evaluation with Human-curated Benchmark

Function Calling Average Accuracy
Function Calling Average Accuracy
Figure 2: (Top) NexusRaven-V2 outperforms GPT-4 by 4% higher function calling success rate on average across human-generated benchmarks. On tasks requiring nested and composite functions, NexusRaven-V2 demonstrates up to 7% advantage over GPT-4. (Bottom) We included 9 tasks on operating real-world software to diversify the use case and difficulties.

We have observed that NexusRaven-V2 outperforms the latest GPT-4 model with a 4% higher success rate in function calling on average on our human-curated benchmark. It is worth noting that in 4 challenging tasks requiring nested and composite function calls, NexusRaven-V2 demonstrates success rates up to 7% higher than the latest GPT-4 models. Additionally, NexusRaven-V2 exhibits greater robustness than GPT-4 when it comes to handling variations in developers' descriptions of functions.

These observations underscore the potential of utilizing open-source models to develop tool-using copilots and agents that can match or surpass the quality of proprietary LLM APIs in terms of both accuracy and robustness.

Releasing the Function Calling Benchmark and Leaderboard

To ensure reproducibility and help standardize function calling evaluations, we release our benchmark and its associated leaderboard along with the model weights. We follow two guiding principles when designing the evaluation benchmark:

  • Human-generated Samples with Meticulous Verifications: We utilize human-generated samples with meticulous checks on the executability for assessing developer and user experiences. The evaluation samples generated by the LLMs from recent literature often fall short due to syntactic non-executability or misalignment between instructions and code.
  • Diverse Representation of Function Calling Use Cases and Difficulties: Our benchmark encompasses a diverse range of function calling use cases including single function calls, parallel function calls, and nested (composite) function calls, thus effectively representing a broad spectrum of tasks and complexities in software operations.
Examples of calling
Figure 3: Examples of calling a) a single function, b) parallel functions and c) nested & composite functions.

We release a Hugging Face leaderboard with seeding evaluations on representative models. If you are interested in evaluating and submitting new models generalizable in software tools unseen during training, please contact us via our Discord channel.

Artifacts for Developer Adoption

We released a python package "nexusraven" to assist you to easily integrate with your copilots or agents. Using this package, you can quickly ingest your API function descriptions, send your natural language queries to the model with a single line of code. To best serve the purpose of integration with other downstream software, the nexusraven package also supports the functionality to convert the function calling code to JSON format. We hope these artifacts, together with the onboarding colab notebook, could help developers to try out NexusRaven-V2 quickly.