Introducing NexusRaven-13B, a superior open-source Large Language Model (LLM) for function calling in operating software tools to date, fully commercially permissive. 

TL;DR:

📊 Performance Highlights: With our demonstration retrieval system, NexusRaven-13B achieves a 95% success rate in using cybersecurity tools such as CVE/CPE Search and VirusTotal, while prompting GPT-4 function calling API achieves 64%. It has significantly lower cost and faster inference speed compared to GPT-4.

🔧 Generalization to the Unseen: NexusRaven-13B generalizes to tools never seen during model training, achieving a success rate comparable with GPT-3.5 in zero-shot setting, significantly outperforming all other open-source LLMs of similar sizes. 

🔥 Commercially Permissive: The training of NexusRaven-13B does not involve any data generated by proprietary LLMs such as GPT-4. You have full control of the model when deployed in commercial applications.

Please check our model at HuggingFace, Code at Github and join our Discord server.

Behind the numbers are our two key techniques: data curation via multi-step refinement, and demonstration retrieval augmentation. If you are interested, please read the full blog below.


The Need for Small, Powerful and Commercially Permissive Solution

The rise of open-source commercially permissive models is transforming generative AI, providing organizations with greater control, reduced risks for sensitive data, and cost savings compared to proprietary models like OpenAI's GPT-3.5/4. These benefits are especially valuable for enterprise generative AI adoption, rather than consumer applications.

Many existing open-source LLMs focusing on tool usage, such as Gorilla, ToolLLAMA, and ToolAlpaca, rely heavily on proprietary LLMs like OpenAI's GPT-3.5/4 to generate large amounts of quality training data. However, legal constraints, like those in OpenAI's terms, prevent the use of such data for building models competitive with the proprietary LLMs in commercial use cases.

Furthermore, LLMs designed for function calling can be applied in real-world business scenarios to serve a crucial role in the common task of operating software. This demands a high degree of reliability and accuracy while keeping costs low. Unfortunately flagship models with general code generation capabilities, such as CodeLLaMA-34B and GPT-4, all appear excessively large with regard to efficiency considerations. This motivates us to ask: Can we build a commercially permissive open-source compact LLM designed for function calling, and use it to deliver enterprise-grade solutions with quality competitive to or better than what proprietary LLMs offer, such as GPT-4 function calling API?

Introducing NexusRaven-13B

NexusRaven translates user queries into executable function calling code, accomplishing tasks based on plain English input.
Figure 1. NexusRaven-13B translates user queries into executable function calling code, accomplishing tasks based on plain English input.

As illustrated in Figure 1, NexusRaven-13B processes human instructions alongside candidate API function documentations. In a zero-shot or few-shot in-context learning scenario, NexusRaven-13B generates executable function calling code with the selected API along with its associated argument values. 

NexusRaven-13B springs from the open-source lineage of the CodeLLAMA-13B model. We curated its training dataset via multi-step refinement using CodeLLAMA-34B-instruct and LLaMA-70B-chat, a technique we will detail in the next section. This guarantees its commercial permissiveness.

NexusRaven-13B achieves the following:

  • When integrated with our demonstration retrieval system, NexusRaven-13B shows up to 30% higher function calling success rate than OpenAI GPT-4 in function calling for operating cybersecurity software, as seen in Figure 2. These software tools, such as CVE/CPE search and VirusTotal, were not included in the model training data and often feature sophisticated argument lists or extensive arrays of functions.
  • NexusRaven-13B excels in zero-shot function calling for software not encountered during its training. Notably, NexusRaven-13B achieves up to 60% higher function call success rate compared to Gorilla, ToolLLAMA, and ToolAlpaca in the cybersecurity domain, and 4% higher than GPT-3.5 without training on any of the functions, as shown in Figure 3.
  • NexusRaven-13B achieves robust function calling capabilities without relying on high-latency search or iterative reasoning techniques, which typically require observing the outcomes of executing incorrect function calls. This guarantees truly interactive response times for applications built on NexusRaven-13B and eliminates the risks associated with incorrect function calls. 
Figure 2. Comparison between the prototype system powered by NexusRaven-13B and the GPT-4 function calling API for CVE and CPE Search, VirusTotal V3, two challenging cybersecurity software in our benchmark.
Figure 3. Zero-shot comparison between NexusRaven-13B and representative function-calling models on both generic domain and cybersecurity domain.

Data Curation with Multi-step Refinement

In the context of creating powerful compact models, it is a standard approach to distill the generation of larger and more powerful models into smaller ones. The power of such distillation is especially well pronounced on models such as Gorilla, ToolLLAMA, and ToolAlpaca, which distill GPT-3.5/4 generations to refine base open-source models, resulting in models not commercially permissive due to OpenAI’s terms of use. 

While generation distillation in the literature may appear straightforward with GPT-4, we have found that these methods do not result in high-quality data when we attempt to distill generations for function calling from commercially permissive models. In particular, we observe that CodeLLaMA-34B-instruct only achieves a 48.3% success rate in generating API function calls for VirusTotal, while GPT-4 attains 80.8%. These observations are likely due to the gap in reasoning capability between CodeLLaMA and GPT-4.

To navigate through these challenges, we pivoted to a new data curation methodology. The principle of this new approach is to decompose the data generation into multiple steps, each requiring simpler and more primitive reasoning, as shown in Figure 4.

Illustration of our multi-step refinement pipeline
Figure 4. Illustration of our multi-step refinement pipeline.

Our data curation starts from mining tuples of function definitions, docstrings, and code context where these functions are called. We expect this curation pipeline to deliver a large volume of generated pairs of plain English queries and function calling code, as well as the Chain-of-Thought (CoT) reasonings for mapping between the query/instruction and the code.

  • Function Explanation. We first feed the raw mined tuples into CodeLLaMA-34B-instruct and generate the capability description of functions. These function explanations are intended to help LLMs better understand the function for data generation.
  • Query Generation. With the mined function calling code and the function explanation, we elicit LLaMA-70B-chat to generate a natural language query description for the code, which we find results in a higher quality natural language query than using CodeLLaMA-34B-instruct.
  • Chain-of-Thought Enhancement. To explicitly improve the reasoning capability for function calling, we further leverage CodeLLaMA-34B-instruct to generate CoT traces elaborating on how the values for arguments are derived. We additionally use these CoT traces and the query to regenerate the function call code to further improve the compatibility between queries and the code.
  • Candidate Function List Generation. Given that selecting the right function from a list of candidates is also required for function calling capability, we use embedding models to augment each curated training data sample with a list of functions similar to the intended one. 

This four-step pipeline empirically yields high quality data samples consisting of queries/instructions, candidate function documentations, CoT reasoning, and the executable code for function calling. We further instruction-tune the CodeLLaMA-13B model with this generated data to supercharge the function calling capabilities.

Demonstration Retrieval Augmentation 

The data curation procedure above, combined with instruction tuning, bumps the CodeLLaMA-13B function call accuracy on the VirusTotal dataset from 38% to 72%. Although a commendable increase, this accuracy is not sufficient for the robustness required in real-world scenarios. Our approach to achieving the highest level of accuracy is demonstration retrieval augmentation.

When integrated with LLMs, conventional retrieval systems primarily function as a component of caching systems. They assist in answering questions from extensive knowledge repositories and facilitate the extraction of the most relevant functions from vast API document collections. Our approach, which is different from these conventional approaches, is to use retrievers to directly source demonstration examples from a corpus of existing query-response pairs. The corpus has 16 examples on average per API function, and we use four-shot prompting to generate function calls.  We find that this significantly boosts the function calling success rate from 72% to 94%.

Illustration of our demonstration retrieval augmentation system
Figure 5. Illustration of our demonstration retrieval augmentation system. When presented with a new query, our system scans an existing corpus to identify demonstration examples that can enhance the quality of responses to that query.

We acknowledge that fine-tuning models on the corpus could boost model performance. Yet, our retrieval system offers two distinct advantages:

  • Generalization. While fine-tuning models directly can be expensive and risk weakening some capabilities, updating the retrieval corpus offers the best of both worlds: it keeps the model generalization capability on diverse and unseen tools, while still being able to significantly boost the accuracy for specific software.
  • Personalization. As we add more prompt-response pairs to the retrieval corpus, the system becomes better equipped to find closely matched responses to incoming queries, thereby significantly enhancing performance. It also adapts to personalized use style when maintaining a corpus for past use cases.

Evaluating NexusRaven for Function Calling Capability

We provide a comprehensive benchmark comparing NexusRaven-13B and existing function-calling models. We show that NexusRaven-13B is comparable to GPT-3.5 in the zero-shot setting, and surpasses the accuracy of GPT-4 when equipped with demonstration retrieval, in both cybersecurity and generic domains.

Evaluation Dataset and Pipeline

We evaluate the function calling capability of different models by sending natural language instructions to all models, along with the function definition and docstring for all available function calls. In each test sample, we provide multiple candidate functions. The candidate functions contain the ground truth function that shall be used for the instructions, along with noisy functions that are similar but not for the instructions. We evaluate the output of the model by comparing it with the ground truth by executing the function call and ensuring the arguments match the ground truth exactly. 

For the cybersecurity domain, our full benchmark consists of human queries to operate CVE and CPE Search, VirusTotal V3, and EmailRep. We collect functions from their API documentation, and curate the ground truth answers by working together with cybersecurity domain experts. We made sure the ground truth answer is consistent with expert evaluation in real-world use cases.

For the generic domain, we consider two popular benchmarks in the literature, ToolAlpaca-Simulated dataset and ToolLLM dataset, for our evaluation. The ToolLLM and ToolAlpaca datasets were generated using GPT-3.5. This inherently resulted in data samples with noisy ground truth annotations. These noisy ground truth annotations are challenging to be fully captured for perfectly reliable evaluation. Nonetheless, we conducted extensive filtering to remove data samples that certainly contain incorrect function calling code.  

The details of the evaluation pipeline can be found here on GitHub.

Performance of the Retrieval-augmented Model

We benchmarked the performance of the retrieval-augmented model. As is shown in Figure 2, our prototype system demonstrates up to 30% higher function calling success rate on average than the GPT-4 on CVE/CPE, providing enterprise-grade function calling capabilities at the last mile of quality. 

Performance of the Zero-shot Model

As is shown in Figure 3, when prompting the models in a zero-shot setting, NexusRaven-13B demonstrates competitive performance in the cybersecurity domain and 4% higher success rate in the generic domain compared to GPT-3.5. It also beats all representative open source function-calling models in both the generic domain and the cybersecurity domain, showing 16% improvement over CodeLLaMA-13B-instruct , and 60% over other open source models in the cybersecurity domain.  

In our zero-shot evaluation, we assumed the LLMs cannot search by observing the outcome of potentially incorrect function calls. This is because incorrect or unintentional function calls could produce unexpected detrimental effects to software systems. Nonetheless, we turned on the search feature of ToolLLM and ToolAlpaca without executing the potentially incorrect function calls. In this setting, we consider function calling to be successful if any trials in the search are correct. We enable this design to be extra fair to ToolLLM and ToolAlpaca. We also note here that Gorilla attains a low success rate because of the lack of generalization on software unseen during training. In contrast, NexusRaven-13B excels in zero-shot function calling for generic domain software APIs not encountered during its training.

Open Sourcing the Evaluation Framework

As LLMs for tools is a new area, we find that the evaluation datasets, methods, and code bases are fragmented and not necessarily compatible with each other. In addition, formatting for the descriptions of the tools are also massively varied, from OpenAPI descriptions to simple JSON descriptions. To facilitate an unified evaluation pipeline, we first standardize the description of tools as python functions, despite them either being APIs or simple functions, as we found this description to be the most intuitive, most compatible with existing code and models, and easy to standardize into. In addition, we also open sourced an evaluation framework integrating the current evaluation benchmarks on Github, as shown in the evaluation section. Users can directly plug-in their function definitions, doc-strings, instruction lists, and ground truth data for one-click evaluation without any extra effort.

We also upload all of our evaluation dataset and results to HuggingFace. Admittedly, our dataset is never able to cover every use case in practice. We are excited to work with the open source and research community to create a better evaluation pipeline together.

Conclusion and Future Steps

  • This version of NexusRaven-13B primarily emphasizes single-round interactions with humans through natural language instructions. We are eager to collaborate with the OSS community to enhance NexusRaven's capability for multi-round interactions.
  • We plan to further refine our evaluation benchmark as a comprehensive function calling benchmark. We are excited to collaborate with the OSS community to standardize the evaluation of tool-using LLMs.
  • We will continue releasing new models in the commercially permissive NexusWise Open-source Model Suite for a wide range of tasks. Please stay tuned.

Acknowledgements

  • We extend our gratitude to the LLAMA 2 and CodeLLAMA team for empowering the open model community with their powerful pretrained models. The creation of NexusRaven-13B would not have been possible without the foundation laid by these open models.
  • We also express our appreciation to the open scientific community for their pioneering efforts in code generation, exemplified by projects like The Stack, Starcoder and CodeGen. These open-source models and data have significantly accelerated our progress in the field of code generation.
  • Over the past year, the research community has achieved remarkable strides in tool-augmented LLMs. Works such as Gorilla, ToolLLM, ToolLlama, and Toolbench have provided the bedrock upon which we built NexusRaven.