# DPO Loss Direct Preference Optimization (DPO) and its variants are used for aligning models with human preferences without requiring a separate reward model. ## DPOLoss The standard DPO loss supports multiple loss types and optional reference-free mode. ```python from twinkle.loss import DPOLoss loss_fn = DPOLoss( loss_type='sigmoid', # 'sigmoid', 'hinge', 'ipo', 'kto' beta=0.1, sft_weight=0.0, # optional SFT regularization weight reference_free=False, ) model.set_loss(loss_fn) ``` **Parameters:** - `loss_type`: DPO variant — `sigmoid` (default), `hinge`, `ipo`, or `kto` - `beta`: Temperature parameter controlling preference strength - `sft_weight`: Weight for an additional SFT loss term on chosen responses - `reference_free`: If `True`, skips reference model log-probabilities The loss expects interleaved chosen/rejected pairs in the batch. It computes sequence-level log-probabilities and optimizes the policy to prefer chosen over rejected responses. ## SimPOLoss Simplified Preference Optimization that removes the need for a reference model by using length-normalized log-probabilities. ```python from twinkle.loss import SimPOLoss loss_fn = SimPOLoss(beta=2.0, gamma=1.0) ``` **Parameters:** - `beta`: Scaling factor for the logit difference - `gamma`: Margin term added to preference gap ## CPOLoss Contrastive Preference Optimization combines preference learning with behavior cloning. ```python from twinkle.loss import CPOLoss loss_fn = CPOLoss(beta=0.1, cpo_alpha=1.0) ``` **Parameters:** - `beta`: Temperature for the preference loss - `cpo_alpha`: Weight of the behavior cloning (NLL) loss on chosen responses ## ORPOLoss Odds Ratio Preference Optimization unifies SFT and preference alignment in a single loss. ```python from twinkle.loss import ORPOLoss loss_fn = ORPOLoss(beta=0.1) ``` The loss combines a standard NLL term on chosen responses with a log-odds-ratio penalty that pushes the model away from rejected responses. > All preference losses inherit shared utilities from `PreferenceLossBase`, including log-probability computation, chosen/rejected splitting, and sequence-level aggregation.