OpenAI’s o1 and o3 represent a major shift in how language models are trained and used - by spending more time thinking about problems during infere

Decentralized Training in the Inference-Time-Compute Paradigm

submited by
Style Pass
2025-01-22 02:30:02

OpenAI’s o1 and o3 represent a major shift in how language models are trained and used - by spending more time thinking about problems during inference, they can better handle complex tasks like solving math problems or writing code.

Reasoning models like these are trained differently compared to traditional chat models: Given a set of problems, they generate candidate solutions that are assigned rewards based on the solutions’ correctness, which allows the models to learn from their own answers and iteratively self-improve without training on human task demonstrations. This iterative process is called online reinforcement learning (RL).

PRIME-RL, A recent piece of work from Tsinghua University and UIUC, has introduced an online RL algorithm that leverages rewards from implicit process reward models. Using this approach, the authors trained Eurus-2-7B-PRIME, which (prior to the release of Deepseek-R1) substantially outperformed all other 7B parameter models as well as bigger models like GPT-4o on mathematical reasoning benchmarks.

While PRIME-RL and related methods lead to great performance, online reinforcement learning tends to be unstable and quite expensive compared to supervised training: since we have to sample solutions from our model at every training step, the whole training process takes very long - reproducing Eurus-2-7B-PRIME needs more than 5 days on an 8xH100 node and costs more than $2700.

Leave a Comment