Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Will this book talk about RLHF? #226

Answered by rasbt
jingedawang asked this question in Q&A
Discussion options

Great book! I read all the notebooks in this repo and here is a question.

I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it?

I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning?

Thanks!

You must be logged in to vote

Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons

  1. The chapter got way, way too long and exceeded the page limits
  2. I am not very happy with the DPO results

DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.

In the meantime, you might like my two articles here:

Replies: 3 comments 8 replies

Comment options

Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons

  1. The chapter got way, way too long and exceeded the page limits
  2. I am not very happy with the DPO results

DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.

In the meantime, you might like my two articles here:

You must be logged in to vote
7 replies
Comment options

Got it. Looking forward to your next work about DPO and RLHF!

Comment options

Yeah, I think it would be nice to include something related to RLHF in the bonus materials.

Comment options

Great book and amazing resource @rasbt! I saw that you included the DPO appendix. Were you able to finish the RLHF appendix?

Comment options

Hi @henrythe9th,

Thank you for your question. Just so you know, unfortunately, @rasbt is currently recovering from an injury right now, so he might need a bit more time to respond. He will answer to your request about the RLHF appendix as soon as he's feeling better. 🙂

Comment options

If it can shed some light @henrythe9th @d-kleine ,in his latest blogpost, Sebastian mentioned working on chapters about "reasoning/thinking" models. (Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having tried implementing GRPO for preference and unfinished for RLVR, it's approximately the same pipeline and use the same RL algos (hybrid like PPO or pure policy gradient like GRPO or variants...) with one of the main difference reward shaping.

What I mean is, I'm pretty sure everyone will learn a lot from these new chapters/book for pure RLHF alignment too.

Answer selected by rasbt
Comment options

Sorry to hear and hope he has a speedy recovery. Yes, I am very excited for the new book on reasoning/thinking models!
...
On Thu, Jun 12, 2025 at 5:13 AM casinca ***@***.***> wrote: If it can shed some light @henrythe9th <https://github.com/henrythe9th> @d-kleine <https://github.com/d-kleine> ,in his latest blogpost, Sebastian mentioned working on chapters about "reasoning/thinking" models. (Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having tried implementing GRPO for preference and unfinished for RLVR, it's approximately the same pipeline and use the same RL algos (hybrid like PPO or pure policy gradient like GRPO or variants...) with one of the main difference reward shaping. What I mean is, I'm pretty sure everyone will learn a lot from these new chapters/book for pure RLHF alignment too. — Reply to this email directly, view it on GitHub <#226 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAE5IUEINRTLO3WE2CHC46L3DFVG3AVCNFSM6AAAAAB6JF7WJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBUGU4DQMI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>
-- Henry Shi Co-Founder, Board Member Linkedin <https://www.linkedin.com/in/henrythe9th> | X <https://x.com/henrythe9ths> | Super.com <https://www.super.com/>
You must be logged in to vote
0 replies
Comment options

hi @rasbt , will you add some usage of verl? seems like it's very hot in llm rlhf

You must be logged in to vote
1 reply
Comment options

rasbt Jul 2, 2025
Maintainer

Thanks for your interest. In this repo, I focus more on from-scratch implementations. So likely more RLHF from-scratch than using another library like verl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
question Further information is requested
Converted from issue

This discussion was converted from issue #225 on June 19, 2024 11:08.

AltStyle によって変換されたページ (->オリジナル) /