SFT is bad RL

submited by

Style Pass

2025-08-06 22:00:05

There have been a few papers recently that observed that learning from incorrect examples helps, even in verifiable domains. Here verifiable just means we have access to some ground truth reward function, usually 0/1 or something in-between. Those papers have found that training on a larger number of incorrect examples can result in better performance than training on just positive examples. This is puzzling. Why would we ever want to train on incorrect examples? My gut says that correct/incorrect is the wrong thing. Instead, we should be asking: What’s the advantage of any given datapoint?

In this blog post, we will show that you can do better than training directly on incorrect examples. We first show that supervised learning on incorrect examples is an instance of reinforcement learning (RL), and that we can do better by actually following RL basics.

The goal of supervised learning is to train a student policy p(x)p(x) p ( x ) to clone the behaviour of a teacher p∗(x)p^*(x) p ∗ ( x ) . The teacher gives sample trajectories which serve as examples to teach the student. SFT searches over student policies to minimize the KL-divergence from the teacher to the student, which is equivalent to maximizing the log-likelihood of the teacher’s samples:

SFT is bad RL

Leave a Comment

Related Posts

Recent Posts

Improving 9-1-1 Operations with Artificial Intelligence

Dear string-to-integer parsers…

This AI Paper Introduces C3: A Bilingual Benchmark Dataset and Evaluation Framework for Complex Spoken Dialogue Modeling

Testing Apps with TestFlight

Web performance | MDN

Japan Law Will Require Apple to Allow Non-WebKit Browsers on iPhone

Drink it up! All 21 Daniel Day-Lewis films – ranked

How researcher visa curbs threaten science careers

Google Confirms It Has Been Hacked — Warns User Data Stolen

Search code, repositories, users, issues, pull requests...

Interest rates cut to lowest level in more than two years

Building Bluesky Comments for My Blog

Optimizing My Disk Usage Program

The Ultimate Drug for Grief: Resurrecting Deceased Loved Ones with AI

OpenAI is practically giving ChatGPT to the government for free

How AI is helping advance the science of bioacoustics to save endangered species

I clustered four Framework Mainboards to test huge LLMs

4 Real-World Success Stories Where GraphRAG Beats Standard RAG

How to Sell if Your User is not the Buyer - by Nate Ritter

Trump demands resignation of Intel CEO over China ties