DPO Loss
Direct Preference Optimization (DPO) and its variants are used for aligning models with human preferences without requiring a separate reward model.
DPOLoss
The standard DPO loss supports multiple loss types and optional reference-free mode.
from twinkle.loss import DPOLoss
loss_fn = DPOLoss(
loss_type='sigmoid', # 'sigmoid', 'hinge', 'ipo', 'kto'
beta=0.1,
sft_weight=0.0, # optional SFT regularization weight
reference_free=False,
)
model.set_loss(loss_fn)
Parameters:
loss_type: DPO variant —sigmoid(default),hinge,ipo, orktobeta: Temperature parameter controlling preference strengthsft_weight: Weight for an additional SFT loss term on chosen responsesreference_free: IfTrue, skips reference model log-probabilities
The loss expects interleaved chosen/rejected pairs in the batch. It computes sequence-level log-probabilities and optimizes the policy to prefer chosen over rejected responses.
SimPOLoss
Simplified Preference Optimization that removes the need for a reference model by using length-normalized log-probabilities.
from twinkle.loss import SimPOLoss
loss_fn = SimPOLoss(beta=2.0, gamma=1.0)
Parameters:
beta: Scaling factor for the logit differencegamma: Margin term added to preference gap
CPOLoss
Contrastive Preference Optimization combines preference learning with behavior cloning.
from twinkle.loss import CPOLoss
loss_fn = CPOLoss(beta=0.1, cpo_alpha=1.0)
Parameters:
beta: Temperature for the preference losscpo_alpha: Weight of the behavior cloning (NLL) loss on chosen responses
ORPOLoss
Odds Ratio Preference Optimization unifies SFT and preference alignment in a single loss.
from twinkle.loss import ORPOLoss
loss_fn = ORPOLoss(beta=0.1)
The loss combines a standard NLL term on chosen responses with a log-odds-ratio penalty that pushes the model away from rejected responses.
All preference losses inherit shared utilities from
PreferenceLossBase, including log-probability computation, chosen/rejected splitting, and sequence-level aggregation.