• About
  • Advertise
  • Privacy & Policy
  • Contact
Friday, January 2, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

Reinforcement Learning from Human Feedback, Explained Simply

AiNEWS2025 by AiNEWS2025
2025-06-24
in Machine Learning
0
Reinforcement Learning from Human Feedback, Explained Simply
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


The appearance of ChatGPT in 2022 completely changed how the world started perceiving artificial intelligence. The incredible performance of ChatGPT led to the rapid development of other powerful LLMs.

We could roughly say that ChatGPT is an upgraded version of GPT-3. But in comparison to the previous GPT versions, this time OpenAI developers not only used more data or just complex model architectures. Instead, they designed an incredible technique that allowed a breakthrough.

In this article, we will talk about RLHF — a fundamental algorithm implemented at the core of ChatGPT that surpasses the limits of human annotations for LLMs. Though the algorithm is based on proximal policy optimization (PPO), we will keep the explanation simple, without going into the details of reinforcement learning, which is not the focus of this article.

NLP development before ChatGPT

To better dive into the context, let us remind ourselves how LLMs were developed in the past, before ChatGPT. In most cases, LLM development consisted of two stages:

Pre-training & fine-tuning framework

Pre-training includes language modeling — a task in which a model tries to predict a hidden token in the context. The probability distribution produced by the model for the hidden token is then compared to the ground truth distribution for loss calculation and further backpropagation. In this way, the model learns the semantic structure of the language and the meaning behind words.

If you want to learn more about pre-training & fine-tuning framework, check out my article about BERT.

After that, the model is fine-tuned on a downstream task, which might include different objectives: text summarization, text translation, text generation, question answering, etc. In many situations, fine-tuning requires a human-labeled dataset, which should preferably contain enough text samples to allow the model to generalize its learning well and avoid overfitting.

This is where the limits of fine-tuning appear. Data annotation is usually a time-consuming task performed by humans. Let us take a question-answering task, for example. To construct training samples, we would need a manually labeled dataset of questions and answers. For every question, we would need a precise answer provided by a human. For instance:

During data annotation, providing full answers to prompts requires a lot of human time.

In reality, for training an LLM, we would need millions or even billions of such (question, answer) pairs. This annotation process is very time-consuming and does not scale well.

RLHF

Having understood the main problem, now it is perfect moment to dive into the details of RLHF.

If you have already used ChatGPT, you have probably encountered a situation in which ChatGPT asks you to choose the answer that better suits your initial prompt:

The ChatGPT interface asks a user to rate two possible answers.

This information is actually used to continuously improve ChatGPT. Let us understand how.

First of all, it is important to notice that choosing the best answer among two options is a much simpler task for a human than providing an exact answer to an open question. The idea we are going to look at is based exactly on that: we want the human to just choose an answer from two possible options to create the annotated dataset.

Choosing between two options is an easier task than asking someone to write the best possible response.

Response generation

In LLMs, there are several possible ways to generate a response from the distribution of predicted token probabilities:

  • Having an output distribution p over tokens, the model always deterministically chooses the token with the highest probability.
The model always selects the token with the highest softmax probability.
  • Having an output distribution p over tokens, the model randomly samples a token according to its assigned probability.
The model randomly chooses a token each time. The highest probability does not guarantee that the corresponding token will be chosen. When the generation process is run again, the results can be different.

This second sampling method results in more randomized model behavior, which allows the generation of diverse text sequences. For now, let us suppose that we generate many pairs of such sequences. The resulting dataset of pairs is labeled by humans: for every pair, a human is asked which of the two output sequences fits the input sequence better. The annotated dataset is used in the next step.

In the context of RLHF, the annotated dataset created in this way is called “Human Feedback”.

Reward Model

After the annotated dataset is created, we use it to train a so-called “reward” model, whose goal is to learn to numerically estimate how good or bad a given answer is for an initial prompt. Ideally, we want the reward model to generate positive values for good responses and negative values for bad responses.

Speaking of the reward model, its architecture is exactly the same as the initial LLM, except for the last layer, where instead of outputting a text sequence, the model outputs a float value — an estimate for the answer.

It is necessary to pass both the initial prompt and the generated response as input to the reward model.

Loss function

You might logically ask how the reward model will learn this regression task if there are not numerical labels in the annotated dataset. This is a reasonable question. To address it, we are going to use an interesting trick: we will pass both a good and a bad answer through the reward model, which will ultimately output two different estimates (rewards).

Then we will smartly construct a loss function that will compare them relatively.

Loss function used in the RLHF algorithm. R₊ refers to the reward assigned to the better response while R₋ is a reward estimated for the worse response.

Let us plug in some argument values for the loss function and analyze its behavior. Below is a table with the plugged-in values:

A table of loss values depending on the difference between R₊ and R₋. 

We can immediately observe two interesting insights:

  • If the difference between R₊ and R₋ is negative, i.e. a better response received a lower reward than a worse one, then the loss value will be proportionally large to the reward difference, meaning that the model needs to be significantly adjusted.
  • If the difference between R₊ and R₋ is positive, i.e. a better response received a higher reward than a worse one, then the loss will be bounded within much lower values in the interval (0, 0.69), which indicates that the model does its job well at distinguishing good and bad responses.

A nice thing about using such a loss function is that the model learns appropriate rewards for generated texts by itself, and we (humans) do not have to explicitly evaluate every response numerically — just provide a binary value: is a given response better or worse.

Training an original LLM

The trained reward model is then used to train the original LLM. For that, we can feed a series of new prompts to the LLM, which will generate output sequences. Then the input prompts, along with the output sequences, are fed to the reward model to estimate how good those responses are.

After generating numerical estimates, that information is used as feedback to the original LLM, which then performs weight updates. A very simple but elegant approach!

RLHF training diagram

Most of the time, in the last step to adjust model weights, a reinforcement learning algorithm is used (usually done by proximal policy optimization — PPO).

Even if it’s not technically correct, if you are not familiar with reinforcement learning or PPO, you can roughly think of it as backpropagation, like in normal machine learning algorithms.

Inference

During inference, only the original trained model is used. At the same time, the model can continuously be improved in the background by collecting user prompts and periodically asking them to rate which of two responses is better.

Conclusion

In this article, we have studied RLHF — a highly efficient and scalable technique to train modern LLMs. An elegant combination of an LLM with a reward model allows us to significantly simplify the annotation task performed by humans, which required huge efforts in the past when done through raw fine-tuning procedures.

RLHF is used at the core of many popular models like ChatGPT, Claude, Gemini, or Mistral.

Resources

All images unless otherwise noted are by the author

Source link

#Reinforcement #Learning #HumanFeedback #Explained #Simply

Tags: chatgptLlmmachine learningNLPRlhf
Previous Post

Researchers get viable mice by editing DNA from two sperm

Next Post

Book review: Surveillance & privacy

AiNEWS2025

AiNEWS2025

Next Post
Book review: Surveillance & privacy

Book review: Surveillance & privacy

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

2026-01-02
Marvel rings in new year with Wonder Man trailer

Marvel rings in new year with Wonder Man trailer

2026-01-02
LG’s new karaoke-ready party speaker uses AI to remove song vocals

LG’s new karaoke-ready party speaker uses AI to remove song vocals

2026-01-02
China Planning Crackdown on AI That Harms Mental Health of Users

China Planning Crackdown on AI That Harms Mental Health of Users

2026-01-02

Recent News

EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

2026-01-02
Marvel rings in new year with Wonder Man trailer

Marvel rings in new year with Wonder Man trailer

2026-01-02
LG’s new karaoke-ready party speaker uses AI to remove song vocals

LG’s new karaoke-ready party speaker uses AI to remove song vocals

2026-01-02
China Planning Crackdown on AI That Harms Mental Health of Users

China Planning Crackdown on AI That Harms Mental Health of Users

2026-01-02
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

2026-01-02
Marvel rings in new year with Wonder Man trailer

Marvel rings in new year with Wonder Man trailer

2026-01-02
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.