Ganqu Cui$^{\dagger }$, Lifan Yuan$^{\dagger }$, Zefan Wang$^$, Hanbin Wang$^$, Wendi Li$^$, Bingxiang He$^$, Yuchen Fan$^$, Tianyu Yu$^$, Qixin Xu$^$, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding*$^{\dagger}$

$\dagger$ : Project lead

$^*$: Core contributors

GitHub: https://github.com/PRIME-RL/PRIME

https://hits.seeyoufarm.com/api/count/incr/badge.svg?pvs=4&url=https%3A%2F%2Fwww.notion.so%2FEurus2-Post-Training-Recipe-for-Advanced-Reasoning-Beyond-Imitation-15f4fcb9c42180f1b498cc9b2eaf896f&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false

<aside> ✨

Tell me and I forget, teach me and I remember, involve me and I learn.
— Benjamin Franklin

Introduction

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?

In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.

Our recent study presented the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Inspired by this captivating property, we find that besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:

<aside>

We then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL.

As an intermediate result, through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to 16.7% improvement ****on average, and over 20% on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. We then train a PRM with the implicit PRM objective for inference-time scaling, which further boosts the models’s reasoning capability.

The evaluation results of the opening figure are detailed below:

Eurus-2-7B-PRIME Eurus-2-7B-SFT Qwen-2.5-Math-7B-Instruct Llama-3.1-70B-Instruct GPT-4o
AIME 2024 26.7 (+23.3) 3.3 13.3 16.73 9.3
MATH-500 79.2 (+14.1) 65.1 79.8 64.6 76.4
AMC 57.8 (+27.7) 30.1 50.6 30.1 45.8
Minerva Math 38.6 (+5.9) 32.7 34.6 35.3 36.8
OlympiadBench 42.1 (+12.3) 29.8 40.7 31.9 43.3
Avg. 48.9 (+ 16.7) 32.2 43.8 35.7 43.3

We achieve this with only 1/10 data resources compared with Qwen-Math. The following is a comparison of resource requirements between Eurus-2-7B-PRIME and Qwen2.5-Math-7B-Instruct.

Eurus-2-7B-PRIME Qwen2.5-Math-7B-Instruct
Base Model Qwen2.5-Math-7B Qwen2.5-Math-7B
SFT Data 230K (open-source) 2.5M (open-source and in-house)
RM Data 0 618K (in-house)
RM Eurus-2-7B-SFT Qwen2.5-Math-RM (72B)
RL Data 150K queries $\times$4 samples 66K queries $\times$ 32 samples

This blog will introduce: