-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Will this book talk about RLHF? #226
-
Great book! I read all the notebooks in this repo and here is a question.
I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it?
I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
- The chapter got way, way too long and exceeded the page limits
- I am not very happy with the DPO results
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.
In the meantime, you might like my two articles here:
Replies: 3 comments 8 replies
-
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
- The chapter got way, way too long and exceeded the page limits
- I am not very happy with the DPO results
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.
In the meantime, you might like my two articles here:
Beta Was this translation helpful? Give feedback.
All reactions
-
Got it. Looking forward to your next work about DPO and RLHF!
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Yeah, I think it would be nice to include something related to RLHF in the bonus materials.
Beta Was this translation helpful? Give feedback.
All reactions
-
Great book and amazing resource @rasbt! I saw that you included the DPO appendix. Were you able to finish the RLHF appendix?
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi @henrythe9th,
Thank you for your question. Just so you know, unfortunately, @rasbt is currently recovering from an injury right now, so he might need a bit more time to respond. He will answer to your request about the RLHF appendix as soon as he's feeling better. 🙂
Beta Was this translation helpful? Give feedback.
All reactions
-
If it can shed some light @henrythe9th @d-kleine ,in his latest blogpost, Sebastian mentioned working on chapters about "reasoning/thinking" models. (Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having tried implementing GRPO for preference and unfinished for RLVR, it's approximately the same pipeline and use the same RL algos (hybrid like PPO or pure policy gradient like GRPO or variants...) with one of the main difference reward shaping.
What I mean is, I'm pretty sure everyone will learn a lot from these new chapters/book for pure RLHF alignment too.
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
hi @rasbt , will you add some usage of verl? seems like it's very hot in llm rlhf
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for your interest. In this repo, I focus more on from-scratch implementations. So likely more RLHF from-scratch than using another library like verl.
Beta Was this translation helpful? Give feedback.