# GRPO Loss Group Relative Policy Optimization (GRPO) and its variants implement policy gradient losses with PPO-style clipping and KL regularization. ## GRPOLoss The standard GRPO loss with importance sampling, PPO clipping, and optional KL penalty. ```python from twinkle.loss import GRPOLoss loss_fn = GRPOLoss( clip_range=0.2, beta=0.01, # KL penalty coefficient ) model.set_loss(loss_fn) ``` **Parameters:** - `clip_range`: PPO clipping range for importance weights (default: 0.2) - `beta`: KL divergence penalty coefficient. Set to 0 to disable KL regularization The loss handles both standard batches and packed sequences (detected via `position_ids`). It computes per-token importance weights, applies PPO clipping, and optionally adds a KL penalty term against the reference policy. ## Variants Twinkle provides several GRPO variants: ### GSPOLoss Sequence-level importance sampling variant that computes importance weights at the sequence level rather than token level. ```python from twinkle.loss import GSPOLoss loss_fn = GSPOLoss(clip_range=0.2, beta=0.01) ``` ### SAPOLoss Soft-gated Advantage Policy Optimization applies a sigmoid gate on the advantage to control the optimization direction. ```python from twinkle.loss import SAPOLoss loss_fn = SAPOLoss(clip_range=0.2, beta=0.01, tau=1.0) ``` ### CISPOLoss Clipped Importance Sampling Policy Optimization applies explicit clipping to importance weights before multiplying with advantages. ```python from twinkle.loss import CISPOLoss loss_fn = CISPOLoss(clip_range=0.2, beta=0.01) ``` ### BNPOLoss Batch-Normalized Policy Optimization normalizes per-token loss across the batch before aggregation. ```python from twinkle.loss import BNPOLoss loss_fn = BNPOLoss(clip_range=0.2, beta=0.01) ``` ### DRGRPOLoss Dynamic Ratio GRPO with fixed normalization that uses a fixed denominator for importance weight computation. ```python from twinkle.loss import DRGRPOLoss loss_fn = DRGRPOLoss(clip_range=0.2, beta=0.01) ``` > All GRPO variants share the same base pipeline for packed-sequence handling, log-probability alignment, and KL penalty computation. They differ primarily in how importance weights and advantages are combined.