Reinforcement Learning from Human Feedback, Explained Simply

The appearance of ChatGPT in 2022 completely changed how the world started perceiving artificial intelligence. The incredible performance of ChatGPT led to the rapid development of other powerful LLMs.

We could roughly say that ChatGPT is an upgraded version of GPT-3. But in comparison to the previous GPT versions, this time OpenAI developers not only used more data or just complex model architectures. Instead, they designed an incredible technique that allowed a breakthrough.

In this article, we will talk about RLHF — a fundamental algorithm implemented at the core of ChatGPT that surpasses the limits of human annotations for LLMs. Though the algorithm is based on proximal policy optimization (PPO), we will keep the explanation simple, without going into the details of reinforcement learning, which is not the focus of this article.

NLP development before ChatGPT

To better dive into the context, let us remind ourselves how LLMs were developed in the past, before ChatGPT. In most cases, LLM development consisted of two stages:

Pre-training includes language modeling — a task in which a model tries to predict a hidden token in the context. The probability distribution produced by the model for the hidden token is then compared to the ground truth distribution for loss calculation and further backpropagation. In this way, the model learns the semantic structure of the language and the meaning behind words.

If you want to learn more about pre-training & fine-tuning framework, check out my article about BERT.

After that, the model is fine-tuned on a downstream task, which might include different objectives: text summarization, text translation, text generation, question answering, etc. In many situations, fine-tuning requires a human-labeled dataset, which should preferably contain enough text samples to allow the model to generalize its learning well and avoid overfitting.

This is where the limits of fine-tuning appear. Data annotation is usually a time-consuming task performed by humans. Let us take a question-answering task, for example. To construct training samples, we would need a manually labeled dataset of questions and answers. For every question, we would need a precise answer provided by a human. For instance:

During data annotation, providing full answers to prompts requires a lot of human time.

In reality, for training an LLM, we would need millions or even billions of such (question, answer) pairs. This annotation process is very time-consuming and does not scale well.

RLHF

Having understood the main problem, now it is perfect moment to dive into the details of RLHF.

If you have already used ChatGPT, you have probably encountered a situation in which ChatGPT asks you to choose the answer that better suits your initial prompt:

*The ChatGPT interface asks a user to rate two possible answers.*

This information is actually used to continuously improve ChatGPT. Let us understand how.

First of all, it is important to notice that choosing the best answer among two options is a much simpler task for a human than providing an exact answer to an open question. The idea we are going to look at is based exactly on that: we want the human to just choose an answer from two possible options to create the annotated dataset.

*Choosing between two options is an easier task than asking someone to write the best possible response.*

Response generation

In LLMs, there are several possible ways to generate a response from the distribution of predicted token probabilities:

Having an output distribution p over tokens, the model always deterministically chooses the token with the highest probability.

*The model always selects the token with the highest softmax probability.*

Having an output distribution p over tokens, the model randomly samples a token according to its assigned probability.

The model randomly chooses a token each time. The highest probability does not guarantee that the corresponding token will be chosen. When the generation process is run again, the results can be different.

This second sampling method results in more randomized model behavior, which allows the generation of diverse text sequences. For now, let us suppose that we generate many pairs of such sequences. The resulting dataset of pairs is labeled by humans: for every pair, a human is asked which of the two output sequences fits the input sequence better. The annotated dataset is used in the next step.

In the context of RLHF, the annotated dataset created in this way is called “Human Feedback”.

Reward Model

After the annotated dataset is created, we use it to train a so-called “reward” model, whose goal is to learn to numerically estimate how good or bad a given answer is for an initial prompt. Ideally, we want the reward model to generate positive values for good responses and negative values for bad responses.

Speaking of the reward model, its architecture is exactly the same as the initial LLM, except for the last layer, where instead of outputting a text sequence, the model outputs a float value — an estimate for the answer.

It is necessary to pass both the initial prompt and the generated response as input to the reward model.

Loss function

You might logically ask how the reward model will learn this regression task if there are not numerical labels in the annotated dataset. This is a reasonable question. To address it, we are going to use an interesting trick: we will pass both a good and a bad answer through the reward model, which will ultimately output two different estimates (rewards).

Then we will smartly construct a loss function that will compare them relatively.

Loss function used in the RLHF algorithm. R₊ refers to the reward assigned to the better response while R₋ is a reward estimated for the worse response.

Let us plug in some argument values for the loss function and analyze its behavior. Below is a table with the plugged-in values:

A table of loss values depending on the difference between R₊ and R₋.

We can immediately observe two interesting insights:

If the difference between R₊ and R₋ is negative, i.e. a better response received a lower reward than a worse one, then the loss value will be proportionally large to the reward difference, meaning that the model needs to be significantly adjusted.
If the difference between R₊ and R₋ is positive, i.e. a better response received a higher reward than a worse one, then the loss will be bounded within much lower values in the interval (0, 0.69), which indicates that the model does its job well at distinguishing good and bad responses.

A nice thing about using such a loss function is that the model learns appropriate rewards for generated texts by itself, and we (humans) do not have to explicitly evaluate every response numerically — just provide a binary value: is a given response better or worse.

Training an original LLM

The trained reward model is then used to train the original LLM. For that, we can feed a series of new prompts to the LLM, which will generate output sequences. Then the input prompts, along with the output sequences, are fed to the reward model to estimate how good those responses are.

After generating numerical estimates, that information is used as feedback to the original LLM, which then performs weight updates. A very simple but elegant approach!

Most of the time, in the last step to adjust model weights, a reinforcement learning algorithm is used (usually done by proximal policy optimization — PPO).

Even if it’s not technically correct, if you are not familiar with reinforcement learning or PPO, you can roughly think of it as backpropagation, like in normal machine learning algorithms.

Inference

During inference, only the original trained model is used. At the same time, the model can continuously be improved in the background by collecting user prompts and periodically asking them to rate which of two responses is better.

Conclusion

In this article, we have studied RLHF — a highly efficient and scalable technique to train modern LLMs. An elegant combination of an LLM with a reward model allows us to significantly simplify the annotation task performed by humans, which required huge efforts in the past when done through raw fine-tuning procedures.