Will this book talk about RLHF? · rasbt/LLMs-from-scratch · Discussion #226

jingedawang
Jun 19, 2024

Great book! I read all the notebooks in this repo and here is a question.

I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it?

I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning?

Thanks!

Answered by rasbt

Jun 19, 2024

Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons

The chapter got way, way too long and exceeded the page limits
I am not very happy with the DPO results

DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.

In the meantime, you might like my two articles here:

View full answer

Replies: 3 comments 8 replies

rasbt
Jun 19, 2024
Maintainer

Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons

The chapter got way, way too long and exceeded the page limits
I am not very happy with the DPO results

In the meantime, you might like my two articles here:

7 replies

@jingedawang

jingedawang Jun 19, 2024
Author

Got it. Looking forward to your next work about DPO and RLHF!

@d-kleine

d-kleine Jun 20, 2024

Yeah, I think it would be nice to include something related to RLHF in the bonus materials.

@henrythe9th

henrythe9th May 30, 2025

Great book and amazing resource @rasbt! I saw that you included the DPO appendix. Were you able to finish the RLHF appendix?

@d-kleine

d-kleine May 31, 2025

Hi @henrythe9th,

Thank you for your question. Just so you know, unfortunately, @rasbt is currently recovering from an injury right now, so he might need a bit more time to respond. He will answer to your request about the RLHF appendix as soon as he's feeling better. 🙂

@casinca

casinca Jun 12, 2025

If it can shed some light @henrythe9th @d-kleine ,in his latest blogpost, Sebastian mentioned working on chapters about "reasoning/thinking" models. (Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having tried implementing GRPO for preference and unfinished for RLVR, it's approximately the same pipeline and use the same RL algos (hybrid like PPO or pure policy gradient like GRPO or variants...) with one of the main difference reward shaping.

What I mean is, I'm pretty sure everyone will learn a lot from these new chapters/book for pure RLHF alignment too.

Answer selected by rasbt

henrythe9th
Jun 12, 2025

Sorry to hear and hope he has a speedy recovery. Yes, I am very excited for the new book on reasoning/thinking models!

...

On Thu, Jun 12, 2025 at 5:13 AM casinca ***@***.***> wrote: If it can shed some light @henrythe9th <https://github.com/henrythe9th> @d-kleine <https://github.com/d-kleine> ,in his latest blogpost, Sebastian mentioned working on chapters about "reasoning/thinking" models. (Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having tried implementing GRPO for preference and unfinished for RLVR, it's approximately the same pipeline and use the same RL algos (hybrid like PPO or pure policy gradient like GRPO or variants...) with one of the main difference reward shaping. What I mean is, I'm pretty sure everyone will learn a lot from these new chapters/book for pure RLHF alignment too. — Reply to this email directly, view it on GitHub <#226 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAE5IUEINRTLO3WE2CHC46L3DFVG3AVCNFSM6AAAAAB6JF7WJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBUGU4DQMI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- Henry Shi Co-Founder, Board Member Linkedin <https://www.linkedin.com/in/henrythe9th> | X <https://x.com/henrythe9ths> | Super.com <https://www.super.com/>

0 replies

dengyunsheng250
Jul 2, 2025

hi @rasbt , will you add some usage of verl? seems like it's very hot in llm rlhf

1 reply

@rasbt

rasbt Jul 2, 2025
Maintainer

Thanks for your interest. In this repo, I focus more on from-scratch implementations. So likely more RLHF from-scratch than using another library like verl.

Will this book talk about RLHF? #226

Uh oh!

jingedawang Jun 19, 2024

Replies: 3 comments · 8 replies

Uh oh!

rasbt Jun 19, 2024 Maintainer

Uh oh!

jingedawang Jun 19, 2024 Author

Uh oh!

d-kleine Jun 20, 2024

Uh oh!

henrythe9th May 30, 2025

Uh oh!

d-kleine May 31, 2025

Uh oh!

casinca Jun 12, 2025

Uh oh!

henrythe9th Jun 12, 2025

Uh oh!

dengyunsheng250 Jul 2, 2025

Uh oh!

rasbt Jul 2, 2025 Maintainer

jingedawang
Jun 19, 2024

Replies: 3 comments 8 replies

rasbt
Jun 19, 2024
Maintainer

jingedawang Jun 19, 2024
Author

henrythe9th
Jun 12, 2025

dengyunsheng250
Jul 2, 2025

rasbt Jul 2, 2025
Maintainer