We’re thrilled to announce Athene-V2, our latest 72B model suite. Fine-tuned from Qwen 2.5 72B, Athene-V2 competes with GPT-4o across key capabilities, powered by a meticulously designed data and RLHF pipeline. As the industry recognizes the slow-down of scaling law—where increasing model size alone no longer delivers universal capability improvements—there’s a growing need for specialized customization to enhance specific capabilities. Our post-training process illustrates this shift, demonstrating how our data and tuning solutions allow us to finely optimize for distinct skills and use cases.
Here’s a look at the unique specializations that position Athene-V2 models along the Pareto frontier of LLM post-training:
- Athene-V2-Chat-72B: A state-of-the-art chat model, matching GPT-4o across multiple benchmarks. It outperforms GPT-4o in chat helpfulness (Arena-Hard), excels in code completion (ranking #2 on bigcode-bench-hard), mathematics (MATH), and handles long log extraction with higher precision (our internal benchmark).
- Athene-V2-Agent-72B: Striking a balance between chat and agent capabilities, this model offers concise, directive chat responses, surpassing GPT-4o in our latest Nexus-V2 function calling benchmarks that focus on hard enterprise-level function calling use cases.
The Pareto Frontier of LLM Post-training
In LLM post-training, it's common to expect universal improvements across all tracked metrics as the model is trained on more high-quality data. However, our observations reveal that as models approach a certain "Pareto frontier"—the point where a balance between multiple performance metrics is optimized—achieving further improvement requires a strategic shift. Beyond this point, the most effective way to realize substantial gains is by refining specific capabilities, trading off certain aspects for focused enhancements along the frontier. This approach enables us to achieve targeted, meaningful improvements rather than universal changes across all metrics.
A good example of such a trade-off is the progression of GPT-4. Initially, some users perceived GPT-4-0613 as a regression from GPT-4-0314, despite improvements made by OpenAI based on user feedback—showcasing the trade-off dynamic at play. This research paper is studying and tracking these changes. Similarly, we’re seeing selective customization efforts, such as Harvey’s collaboration with OpenAI to tailor models for legal applications, among other domains.
We also observe a similar trend in our post-training processes. As is shown in Figure 2, the quality of post-training data and RLHF strategies defines a hidden Pareto frontier that governs the balance between chat and agent capabilities. Customization using the state-of-the-art post-training pipelines allows movement along this frontier, rather than beyond it. For example, the Athene-V2-Agent model emphasizes agent-oriented capabilities, sacrificing slightly general chat flexibility, while the Athene-V2-Chat model excels in dialogue yet shows some limitations on agent-related tasks.
Building AI Agents in Production Requires Deeper Customization
Deploying production-ready agents demands deeper customization than what standard benchmarks can measure. Traditional benchmarks often fall short because they can't fully capture model performance within complex, real-world systems. Instead, actionable insights emerge from analyzing system execution results holistically. For instance, examining precision-recall tradeoffs highlights how customization enhances model effectiveness.
Excelling in these challenging metrics requires even deeper customization than the current approaches used in our Athene V2 pipeline. Nexusflow provides the expertise and tools to further optimize agents for real-world complexities, unlocking full potential in production settings.
As a concrete example, consider a ticket management and search system with 200 filter options including customer name, category, location, time, priority, access, urgency, and status. Feeding all options to a model may cause over-filtering and empty search results. As a concrete example in Figure 3, a request for "urgent customer threads in SF which require manager approval" might trigger both urgency and priority filters incorrectly, despite being prompted to treat them differently.
This over-filtering behavior requires balancing the model's precision and recall – reducing unrelated filter triggers while focusing on relevant ones. As is shown in Figure 4, fine-tuning on carefully curated datasets to balance precision and recall can move the model to a regime with higher overall accuracy, and usually achieves better results than pure prompt tuning. The results are directly reported in our search-FC benchmark numbers in Table 5, which show a significant gap between our specially customized Athene model and other models. Interestingly, by further tuning on more function calling data, we also observe higher gains in agent performance.
Based on our engagements with enterprise customers, we believe that deep customization and continuous learning from user interactions are essential to improving agent model quality. We provide tuning recipes and pipelines to our enterprise partners, enabling them to build robust agent models tailored to their specific systems with continuous quality improvement. Contact our team for a demo!