Diffusion Forcing enjoys key strengths of both next-token autoregressive models and full-sequence diffusion models. By training Diffusion Forcing once, one can flexibly control its behavior at sampling time to simultaneously perform flexible and compositional geneation like next-token models, and perform sequence level guidance like full-sequence diffusion models.
Diffusion Forcing achieves so by training sequence diffusion but allowing each token to have a different noise level. One can view noises in diffusion as varying levels of masking and establish a unified view: full-sequence diffusion denoise all frames at once with the same noise level, while next-token prediction denoises next frame at a time with zero noise in its past tokens.
As a result, one can use different noise levels across a sequence at sampling time to achieve flexible behaviors such as stablizing auto-regressive rollout, guidance over long horizon or planning with causal uncertainty.