Home Machine Learning TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization

Machine Learning

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization

April 19, 2025

[ad_1]

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the…

[ad_2]

Source link

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization

More News

Bravely Default HD Remaster For Nintendo Switch 2 Is Finally Up...

Official Nintendo Playing Cards – All Of The Mario & Zelda Decks Available Now

Nintendo Switch 2 May Record Your Audio And Video Chats

Let's All Speculate Wildly About What Outer Wilds Dev's New Game Is

GTA 6's Trailer 2 Looked Great, And It Wasn't All Cutscenes