Ganqu Cui$^{\dagger }$, Lifan Yuan$^{\dagger }$, Zefan Wang$^$, Hanbin Wang$^$, Wendi Li$^$, Bingxiang He$^$, Yuchen Fan$^$, Tianyu Yu$^$, Qixin Xu$^$, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding*$^{\dagger}$

$\dagger$ : Project lead

$^*$: Core contributors

GitHub: https://github.com/PRIME-RL/PRIME

<aside> ✨

We present PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation.
With PRIME, starting from Qwen2.5-Math-7B-Base, our trained model Eurus-2-7B-PRIME achieves 26.7% pass@1 on AIME 2024, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. We achieve this with only 1/10 data of Qwen Math (230K SFT + 150K RL).
We also explore inference-time scaling and train EurusPRM, a SOTA-level math PRM that pushes the boundary even further.
Work in Progress. All models and data released. Code coming soon! </aside>

Tell me and I forget, teach me and I remember, involve me and I learn.
— Benjamin Franklin

Introduction

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?

In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.

Our recent study presented the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Inspired by this captivating property, we find that besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:

<aside>

Dense Reward: Implicit PRM directly learns a Q-function that provides rewards for each token, which alleviates the reward sparsity issue without the need of an extra value model.
Scalability: Implicit PRM can be online updated with only outcome label. Therefore, we can directly update the PRM with on-policy rollouts given outcome verifiers, which mitigates the distribution shift as well as scalability issues for PRMs.
Simplicity: Implicit PRM is inherently a language model. In practice, we show that it is unnecessary to train a PRM beforehand, since the SFT model itself already serves as a strong starting point. </aside>

We then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL.

As an intermediate result, through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to 16.7% improvement ****on average, and over 20% on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. We then train a PRM with the implicit PRM objective for inference-time scaling, which further boosts the models’s reasoning capability.

The evaluation results of the opening figure are detailed below:

	Eurus-2-7B-PRIME	Eurus-2-7B-SFT	Qwen-2.5-Math-7B-Instruct	Llama-3.1-70B-Instruct	GPT-4o
AIME 2024	26.7 (+23.3)	3.3	13.3	16.73	9.3
MATH-500	79.2 (+14.1)	65.1	79.8	64.6	76.4
AMC	57.8 (+27.7)	30.1	50.6	30.1	45.8
Minerva Math	38.6 (+5.9)	32.7	34.6	35.3	36.8
OlympiadBench	42.1 (+12.3)	29.8	40.7	31.9	43.3
Avg.	48.9 (+ 16.7)	32.2	43.8	35.7	43.3

We achieve this with only 1/10 data resources compared with Qwen-Math. The following is a comparison of resource requirements between Eurus-2-7B-PRIME and Qwen2.5-Math-7B-Instruct.

	Eurus-2-7B-PRIME	Qwen2.5-Math-7B-Instruct
Base Model	Qwen2.5-Math-7B	Qwen2.5-Math-7B
SFT Data	230K (open-source)	2.5M (open-source and in-house)
RM Data	0	618K (in-house)
RM	Eurus-2-7B-SFT	Qwen2.5-Math-RM (72B)
RL Data	150K queries $\times$4 samples	66K queries $\times$ 32 samples

This blog will introduce: