Direct Preference Optimization Explained In-depth

submited by
Style Pass
2024-04-30 19:00:10

With my first blog post, I want to cover an excellent paper that was published last year: Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al.

Commonly referred to as DPO, this method of preference tuning is an alternative to Reinforcement Learning from Human Feedback (RLHF) that avoids the actual reinforcement learning. In this blog post, I will explain DPO from first principles; readers do not need an understanding of RLHF. However, fair warning that there will be some math involved - mostly probability, algebra, and optimization - but I will do my best to explain everything clearly.

To contextualize DPO, and preference-tuning in general, let’s review the modern process for creating language models such as ChatGPT or Claude. The following steps are sequential, with each one building upon the previous:

Pre-train a base model on internet-scale data. Given a snippet of text, this model is trained to predict the immediate next word. This conceptually simple task scales up extremely well and allows LLMs to encode a huge amount of knowledge from their training data. Examples of base models include GPT-3, Llama3, and Mistral.

Leave a Comment