GRPOMetric
The GRPOMetric tracks policy optimization diagnostics during GRPO training, including KL divergence, clipping rates, entropy, and log-probability statistics.
Usage
from twinkle.metric import GRPOMetric
metric = GRPOMetric(
device_mesh=device_mesh,
process_group=process_group,
epsilon=0.2, # PPO clip range
temperature=1.0, # Sampling temperature for logp rescaling
top_k_kl=10, # Track top-K high-KL tokens per step
)
# During training loop
metric.accumulate(inputs, outputs, old_logps=old_logps, advantages=advantages)
# At log interval
results = metric.calculate()
# results: {
# 'train/policy_confidence': 0.85,
# 'train/mean_new_logp': -1.23,
# 'train/mean_old_logp': -1.30,
# 'train/logp_diff_mean': 0.07,
# 'train/approx_kl': 0.003,
# 'train/token_kl_max': 0.15,
# 'train/entropy': 2.1,
# 'train/clip_ratio': 0.02,
# 'train/clip_ratio_low': 0.01,
# 'train/clip_ratio_high': 0.01,
# }
Reported Metrics
| Metric | Description |
|---|---|
train/policy_confidence |
exp(mean_new_logp) — higher means model is more confident |
train/mean_new_logp |
Average log-probability of generated tokens under current policy |
train/mean_old_logp |
Average log-probability under reference policy |
train/logp_diff_mean |
Mean (new - old) log-probability difference |
train/approx_kl |
Schulman K3 estimator of KL(old || new) |
train/token_kl_max |
Maximum per-token KL across all ranks |
train/token_ratio_max |
Maximum importance weight across all ranks |
train/entropy |
Average token-level entropy |
train/clip_ratio |
Fraction of tokens clipped (low + high) |
train/clip_ratio_low |
Fraction clipped below (ratio < 1-ε, negative advantage) |
train/clip_ratio_high |
Fraction clipped above (ratio > 1+ε, positive advantage) |
Variants
GSPOMetric— Computes clip rate at sequence level (geometric-mean ratio per sequence)CISPOMetric— Unconditional clip rate (not gated by advantage sign)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
epsilon |
float | 0.2 | Lower clip boundary |
epsilon_high |
float | None | Upper clip boundary (defaults to epsilon) |
temperature |
float | 1.0 | Rescale logps to T=1 before computing KL |
top_k_kl |
int | 0 | If > 0, record top-K high-KL token details |
ignore_index |
int | -100 | Label value to mask out |