I’ve been thinking about Reinforcement Learning from Human Feedback (RLHF) a lot lately, mostly as a result of my AGISF capstone project attempting to use it to teach a language model to write better responses to Reddit writing prompts, a la Learning to summarize from human feedback.
RLHF has generated some impressive outputs lately, but there seems to be a significant amount of disagreement regarding its potential as a partial or complete solution to alignment: some are excited to extend the promising results we have so far, while others are more pessimistic and perhaps even opposed to further work along these lines. I find myself optimistic about the usefulness of RLHF work, but far from confident that all of the method’s shortcomings can be overcome.
At a high level, RLHF learns a reward model for a certain task based on human feedback and then trains a policy to optimize the reward received from the reward model. In practice, the reward model learned is likely overfit - the policy can thus benefit from interpolating between a policy that optimizes the reward model’s reward and a policy trained through pure imitation learning.