DPOMetric

The DPOMetric tracks preference optimization-specific statistics during DPO training.

from twinkle.metric import DPOMetric

metric = DPOMetric(device_mesh=..., process_group=...)

# Accumulate after each forward pass
metric.accumulate(inputs, outputs, ref_outputs=ref_outputs)

# Calculate aggregated metrics
result = metric.calculate()

Tracked metrics:

chosen_logps: Average log-probability of chosen responses
rejected_logps: Average log-probability of rejected responses
ref_chosen_logps: Reference model log-probability of chosen responses
ref_rejected_logps: Reference model log-probability of rejected responses
rewards/chosen: Implicit reward for chosen responses
rewards/rejected: Implicit reward for rejected responses
accuracy: Fraction of pairs where chosen is preferred over rejected
margin: Average reward margin between chosen and rejected

DPOMetric performs DP-aware aggregation across all data-parallel ranks. It supports both interleaved and separate chosen/rejected batch formats.