Direct Preference Optimization Explained In-depth

submited by

Style Pass

2024-04-30 19:00:10

With my first blog post, I want to cover an excellent paper that was published last year: Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al.

Commonly referred to as DPO, this method of preference tuning is an alternative to Reinforcement Learning from Human Feedback (RLHF) that avoids the actual reinforcement learning. In this blog post, I will explain DPO from first principles; readers do not need an understanding of RLHF. However, fair warning that there will be some math involved - mostly probability, algebra, and optimization - but I will do my best to explain everything clearly.

To contextualize DPO, and preference-tuning in general, let’s review the modern process for creating language models such as ChatGPT or Claude. The following steps are sequential, with each one building upon the previous:

Pre-train a base model on internet-scale data. Given a snippet of text, this model is trained to predict the immediate next word. This conceptually simple task scales up extremely well and allows LLMs to encode a huge amount of knowledge from their training data. Examples of base models include GPT-3, Llama3, and Mistral.

google / libphonenumber

Comment

on motivation and optimization

Comment

A Python-based optimization framework for high-performance genomics

Comment

microsoft / FLAML

Comment

To Dissect a Mockingbird: A Graphical Notation for the Lambda Calculus with Animated Reduction

Comment

How many spaces should come after a period/full stop?

Comment

With A Staggering Depth Of 196 Feet, The World's Deepest Pool Is Opening In Dubai

Comment

Harder, Better, Faster, Stronger

Comment

Why have there been so many own goals at the Euros? The bizarre Euro 2020 trend, explained by football experts

Comment

What Are Deepfakes: Synthetic Media Explained

Comment

Direct Preference Optimization Explained In-depth

Leave a Comment

Related Posts

google / libphonenumber

on motivation and optimization

A Python-based optimization framework for high-performance genomics

microsoft / FLAML

To Dissect a Mockingbird: A Graphical Notation for the Lambda Calculus with Animated Reduction

How many spaces should come after a period/full stop?

With A Staggering Depth Of 196 Feet, The World's Deepest Pool Is Opening In Dubai

Harder, Better, Faster, Stronger

Why have there been so many own goals at the Euros? The bizarre Euro 2020 trend, explained by football experts

What Are Deepfakes: Synthetic Media Explained

Recent Posts

What's new in Kotlin 2.0.0

Computer Science > Human-Computer Interaction

Join us on Sunday, May 19th, 2024 for the IEEE i50 Celebration - Inspiring ENGINEERING for the next 50 years

https://keepy.us

Git Push to Run CI/CD is a Terrible Developer Experience

Search code, repositories, users, issues, pull requests...

I was detained at a US airport and asked about Israel and Gaza for 2 hours. Why?

Do astrological signs predict personality? A study.

If This Is Tesla’s ‘Cybercab’, It May Have 2 Seats, Scissor Doors, And Covered Rear Wheels

‘It’s not vital to spend five days a week in the office’: the bank boss who works from home

Apple needs to explain that bug that resurfaced deleted photos

Unlock a new era of innovation with Windows Copilot Runtime and Copilot+ PCs

Khan Academy and Microsoft partner to expand access to AI tools that personalize teaching and help make learning fun

whoa there, pardner!

Today in History: May 21, Amelia Earhart crosses the Atlantic Ocean

Search code, repositories, users, issues, pull requests...

Union Square Ventures

A New Way to Store Knowledge

Strategic Altruism: The Machiavellian Roots of Human Kindness

How a Chemist and His ‘Poison Squad’ Inspired the First Food Safety Regulations