GRPO Loss
Group Relative Policy Optimization (GRPO) and its variants implement policy gradient losses with PPO-style clipping and KL regularization.
GRPOLoss
The standard GRPO loss with importance sampling, PPO clipping, and optional KL penalty.
from twinkle.loss import GRPOLoss
loss_fn = GRPOLoss(
clip_range=0.2,
beta=0.01, # KL penalty coefficient
)
model.set_loss(loss_fn)
Parameters:
clip_range: PPO clipping range for importance weights (default: 0.2)beta: KL divergence penalty coefficient. Set to 0 to disable KL regularization
The loss handles both standard batches and packed sequences (detected via position_ids). It computes per-token importance weights, applies PPO clipping, and optionally adds a KL penalty term against the reference policy.
Variants
Twinkle provides several GRPO variants:
GSPOLoss
Sequence-level importance sampling variant that computes importance weights at the sequence level rather than token level.
from twinkle.loss import GSPOLoss
loss_fn = GSPOLoss(clip_range=0.2, beta=0.01)
SAPOLoss
Soft-gated Advantage Policy Optimization applies a sigmoid gate on the advantage to control the optimization direction.
from twinkle.loss import SAPOLoss
loss_fn = SAPOLoss(clip_range=0.2, beta=0.01, tau=1.0)
CISPOLoss
Clipped Importance Sampling Policy Optimization applies explicit clipping to importance weights before multiplying with advantages.
from twinkle.loss import CISPOLoss
loss_fn = CISPOLoss(clip_range=0.2, beta=0.01)
BNPOLoss
Batch-Normalized Policy Optimization normalizes per-token loss across the batch before aggregation.
from twinkle.loss import BNPOLoss
loss_fn = BNPOLoss(clip_range=0.2, beta=0.01)
DRGRPOLoss
Dynamic Ratio GRPO with fixed normalization that uses a fixed denominator for importance weight computation.
from twinkle.loss import DRGRPOLoss
loss_fn = DRGRPOLoss(clip_range=0.2, beta=0.01)
All GRPO variants share the same base pipeline for packed-sequence handling, log-probability alignment, and KL penalty computation. They differ primarily in how importance weights and advantages are combined.