OAK

Preference Distillation via Value based Reinforcement Learning

Metadata Downloads
Author(s)
Kwon, MinchanKo, JunwonKim, KangilKim, Junmo
Type
Conference Paper
Citation
The Thirty-Ninth Annual Conference on Neural Information Processing Systems
Issued Date
2025-12-04
Abstract
Direct Preference Optimization (DPO) is a powerful paradigm to align language models with human preferences using pairwise comparisons. However, its binary win-or-loss supervision often proves insufficient for training small models with limited capacity. Prior works attempt to distill information from large teacher models using behavior cloning or KL divergence. These methods often focus on mimicking current behavior and overlook distilling reward modeling. To address this issue, we propose \textit{Teacher Value-based Knowledge Distillation} (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide. This auxiliary reward is formulated to satisfy potential-based reward shaping, ensuring that the global reward structure and optimal policy of DPO are preserved. TVKD can be integrated into the standard DPO training framework and does not require additional rollouts. Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.
Publisher
NeuRIPS
Conference Place
US
San Diego,
URI
https://scholar.gist.ac.kr/handle/local/33452
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.