Ganqu Cui$^{\dagger }$, Lifan Yuan$^{\dagger }$, Zefan Wang$^$, Hanbin Wang$^$, Wendi Li$^$, Bingxiang He$^$, Yuchen Fan$^$, Tianyu Yu$^$, Qixin Xu$^$

Process Reinforcement through Implicit Rewards | Notion

submited by
Style Pass
2025-01-03 05:30:03

Ganqu Cui$^{\dagger }$, Lifan Yuan$^{\dagger }$, Zefan Wang$^$, Hanbin Wang$^$, Wendi Li$^$, Bingxiang He$^$, Yuchen Fan$^$, Tianyu Yu$^$, Qixin Xu$^$, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding*$^{\dagger}$

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?

In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.

Leave a Comment