DPOMetric

The DPOMetric tracks preference optimization-specific statistics during DPO training.

from twinkle.metric import DPOMetric

metric = DPOMetric(device_mesh=..., process_group=...)

# Accumulate after each forward pass
metric.accumulate(inputs, outputs, ref_outputs=ref_outputs)

# Calculate aggregated metrics
result = metric.calculate()

Tracked metrics:

  • chosen_logps: Average log-probability of chosen responses

  • rejected_logps: Average log-probability of rejected responses

  • ref_chosen_logps: Reference model log-probability of chosen responses

  • ref_rejected_logps: Reference model log-probability of rejected responses

  • rewards/chosen: Implicit reward for chosen responses

  • rewards/rejected: Implicit reward for rejected responses

  • accuracy: Fraction of pairs where chosen is preferred over rejected

  • margin: Average reward margin between chosen and rejected

DPOMetric performs DP-aware aggregation across all data-parallel ranks. It supports both interleaved and separate chosen/rejected batch formats.