Background: I’m Kyle, the founder of OpenPipe. OpenPipe is a managed fine-tuning service that makes it easy to build your own LLMs that achieve very

Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever (RLHF Part 1) - OpenPipe

submited by
Style Pass
2024-10-28 17:30:09

Background: I’m Kyle, the founder of OpenPipe. OpenPipe is a managed fine-tuning service that makes it easy to build your own LLMs that achieve very high accuracy on a specific task. In this post we’ll go under the covers and explain RLHF, which is one of the techniques we use to accomplish this.

None reached the front page; in fact none of them even got any upvotes! But they were all identified by a fine-tuned model as being likely to do well on HN. And subjectively (as someone who spends far more time on HN than I should) I actually agree with the model on this one; those all look like stories that deserved more attention than they got.

In this post we’ll discuss how to build a reward model that can predict the upvote count that a specific HN story will get. And in follow-up posts in this series, we’ll use that reward model along with reinforcement learning to create a model that can write high-value HN stories!

Reinforcement learning (RL) is a set of ML techniques that improves a model performance by letting it take actions in an environment, and then get rewarded or penalized for those actions. Based on the rewards or penalties, the model’s behavior is updated over time to (hopefully) do more of the actions that are rewarded, and avoid those that are penalized.

Leave a Comment