We are excited to announce the release of Athene-Llama3-70B by Nexusflow, a strong open-weights chat model fine-tuned from Meta AI’s Llama-3-70B. Athene-70B has achieved an impressive Arena-Hard-Auto score of 77.8%, placing it close to leading proprietary models such as GPT-4o (79.2%) and Claude-3.5-Sonnet (79.3%). This represents a significant leap from its predecessor, Llama-3-70B-Instruct, which scored 46.6%. Athene-70B is now under public testing on Chatbot Arena. The improvement comes from Nexusflow’s targeted post-training pipeline to enhance desired model behaviors.

Figure 1: Comparison of Arena-Hard-Auto performance. Arena-Hard is an automatic benchmark that has the highest correlation with human testing on the chatbot arena.

Harnessing the Power of Targeted Post-Training

To push Llama-3-70B to its limits, we first developed rigorous internal benchmarks that evaluate LLM capabilities in instruction following, coding, creative writing and multilingual tasks. Based on the evaluation results, we curated high-quality preference data for targeted Reinforcement Learning from Human Feedback (RLHF). Our pipeline has yielded significant performance improvements with respect to Llama-3-70B-Instruct. Following a similar line of analysis in [1], the figure below illustrates the model's notable enhancements across key aspects:

Figure 2 (Left): Comparative analysis of Athene-70B's improvements over Llama-3-70B-Instruct based on Ensemble-as-Judges. The number represents the win-rate of Athene-70B over Llama-3-70B-Instruct.
Figure 3 (Right): Performance comparison between Athene-70B and Llama-3-70B-Instruct across common languages. The number represents multilingual score of Athene-70B and Llama-3-70B-Instruct.
  1. Precise Instruction Following: Accurately responds to intricate prompts with clear, controllable, and concise outputs.
  2. Math and Reasoning: Strong capability for solving reasoning-heavy questions.
  3. Comprehensive Coding Assistance: Offers thorough, executable code suggestions, streamlining generation and implementation.
  4. Inspired Creative Writing: Generates innovative, engaging answers that showcase heightened imagination and flair.
  5. Multilingual Mastery: Excels across diverse language tasks, including precise translation, nuanced understanding, and more.

Building the Frontier Models for our Customers

Athene-70B highlights Nexusflow's ability to tailor models for specific enterprise needs through targeted post-training. Building on our successes with Starling-7B (the leading open 7B chat model) and NexusRaven-V2 (the top-performing open function calling model), we are dedicated to pushing our models to the frontier of enterprise-grade applications, offering customized enterprise-grade solutions that help businesses thrive in GenAI copilot and agent. Discover how Athene-70B and Nexusflow.ai can elevate your organization's AI initiatives by contacting us today!

[1] From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. https://arxiv.org/abs/2406.11939  

[2]  What’s up with Llama 3? Arena data analysis. https://lmsys.org/blog/2024-05-08-llama3/

[3] Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF. https://starling.cs.berkeley.edu/ 

[4] NexusRaven-V2: Surpassing GPT-4 for Zero-shot Function Calling. https://nexusflow.ai/blogs/ravenv2