In order to better align the responses instruction-tuned LLMs  generate to what humans would prefer, we can train LLMs against a reward model or a dat

Aligning a LLM with Human Preferences#

submited by
Style Pass
2024-02-11 15:30:06

In order to better align the responses instruction-tuned LLMs generate to what humans would prefer, we can train LLMs against a reward model or a dataset of human preferences in a process known as RLHF (Reinforcement Learning with Human Feedback).

DataDreamer makes this process extremely simple and straightforward to accomplish. We demonstrate it below using LoRA to only train a fraction of the weights with DPO.

Leave a Comment