SFT is bad RL

submited by
Style Pass
2025-08-06 22:00:05

There have been a few papers recently that observed that learning from incorrect examples helps, even in verifiable domains. Here verifiable just means we have access to some ground truth reward function, usually 0/1 or something in-between. Those papers have found that training on a larger number of incorrect examples can result in better performance than training on just positive examples. This is puzzling. Why would we ever want to train on incorrect examples? My gut says that correct/incorrect is the wrong thing. Instead, we should be asking: What’s the advantage of any given datapoint?

In this blog post, we will show that you can do better than training directly on incorrect examples. We first show that supervised learning on incorrect examples is an instance of reinforcement learning (RL), and that we can do better by actually following RL basics.

The goal of supervised learning is to train a student policy p(x)p(x) p ( x ) to clone the behaviour of a teacher p∗(x)p^*(x) p ∗ ( x ) . The teacher gives sample trajectories which serve as examples to teach the student. SFT searches over student policies to minimize the KL-divergence from the teacher to the student, which is equivalent to maximizing the log-likelihood of the teacher’s samples:

Leave a Comment
Related Posts