# GRPOMetric

The `GRPOMetric` tracks policy optimization diagnostics during GRPO training, including KL divergence, clipping rates, entropy, and log-probability statistics.

## Usage

```python
from twinkle.metric import GRPOMetric

metric = GRPOMetric(
    device_mesh=device_mesh,
    process_group=process_group,
    epsilon=0.2,          # PPO clip range
    temperature=1.0,      # Sampling temperature for logp rescaling
    top_k_kl=10,          # Track top-K high-KL tokens per step
)

# During training loop
metric.accumulate(inputs, outputs, old_logps=old_logps, advantages=advantages)

# At log interval
results = metric.calculate()
# results: {
#   'train/policy_confidence': 0.85,
#   'train/mean_new_logp': -1.23,
#   'train/mean_old_logp': -1.30,
#   'train/logp_diff_mean': 0.07,
#   'train/approx_kl': 0.003,
#   'train/token_kl_max': 0.15,
#   'train/entropy': 2.1,
#   'train/clip_ratio': 0.02,
#   'train/clip_ratio_low': 0.01,
#   'train/clip_ratio_high': 0.01,
# }
```

## Reported Metrics

| Metric | Description |
|:-------|:------------|
| `train/policy_confidence` | exp(mean_new_logp) — higher means model is more confident |
| `train/mean_new_logp` | Average log-probability of generated tokens under current policy |
| `train/mean_old_logp` | Average log-probability under reference policy |
| `train/logp_diff_mean` | Mean (new - old) log-probability difference |
| `train/approx_kl` | Schulman K3 estimator of KL(old \|\| new) |
| `train/token_kl_max` | Maximum per-token KL across all ranks |
| `train/token_ratio_max` | Maximum importance weight across all ranks |
| `train/entropy` | Average token-level entropy |
| `train/clip_ratio` | Fraction of tokens clipped (low + high) |
| `train/clip_ratio_low` | Fraction clipped below (ratio < 1-ε, negative advantage) |
| `train/clip_ratio_high` | Fraction clipped above (ratio > 1+ε, positive advantage) |

## Variants

- **`GSPOMetric`** — Computes clip rate at sequence level (geometric-mean ratio per sequence)
- **`CISPOMetric`** — Unconditional clip rate (not gated by advantage sign)

## Parameters

| Parameter | Type | Default | Description |
|:----------|:-----|:--------|:------------|
| `epsilon` | float | 0.2 | Lower clip boundary |
| `epsilon_high` | float | None | Upper clip boundary (defaults to epsilon) |
| `temperature` | float | 1.0 | Rescale logps to T=1 before computing KL |
| `top_k_kl` | int | 0 | If > 0, record top-K high-KL token details |
| `ignore_index` | int | -100 | Label value to mask out |