- Robust DPO with provable redescending property.
- Principled data valuation and cleaning method.
We will tidy up the code soon. If you have already the code basis for DPO, our loss is simply replace it as follows:
import torch.nn.functional as F # pi_logps : policy logprobs, shape (B,) # ref_logps : reference model logprobs, shape (B,) # yw_idxs : preferred completion indices, shape (T,) # yl_idxs : dispreferred indices, shape (T,) # beta, beta_1 : regularization coefficients pi_yw_logps = pi_logps[yw_idxs] pi_yl_logps = pi_logps[yl_idxs] ref_yw_logps = ref_logps[yw_idxs] ref_yl_logps = ref_logps[yl_idxs] reward_win = pi_yw_logps - ref_yw_logps reward_lose = pi_yl_logps - ref_yl_logps g_theta = reward_win - reward_lose if self.method == "dpo": loss = -F.logsigmoid(self.beta * g_theta).mean() elif self.method == "holder_dpo": p = F.sigmoid(self.beta * g_theta) loss = - (1.0 + self.gamma) * p.pow(self.gamma).mean() \ + self.gamma * (p.pow(self.gamma + 1)).mean() return loss
Please cite this work as
@article{fujisawa2025scalable,
title={Scalable Valuation of Human Feedback through Provably Robust Model Alignment},
author={Fujisawa, Masahiro and Adachi, Masaki and Osborne, Michael A},
booktitle={Advances in Neural Information Processing Systems},
doi={https://doi.org/10.48550/arXiv.2505.17859},
year={2025}
}