# GKD Loss

Generalized Knowledge Distillation (GKD) loss uses Jensen-Shannon Divergence for distilling knowledge from a teacher model to a student model.

```python
from twinkle.loss import GKDLoss

loss_fn = GKDLoss(
    teacher_mode='full',  # 'full', 'topk_local', 'topk_remote'
    beta=0.5,             # interpolation weight for JSD
    temperature=1.0,
)

model.set_loss(loss_fn)
```

**Parameters:**
- `teacher_mode`: How teacher logits are obtained
  - `full`: Full vocabulary logits from a local teacher model
  - `topk_local`: Top-k logits from a local teacher with chunked computation for memory efficiency
  - `topk_remote`: Top-k logits from a remote API teacher
- `beta`: Interpolation weight between student and teacher distributions in JSD (0 = pure student, 1 = pure teacher)
- `temperature`: Softmax temperature for both student and teacher distributions

The GKD loss implements chunked computation internally to reduce peak memory usage when working with large vocabularies.

> GKD is useful for training smaller student models that mimic the behavior of larger teacher models, and supports both local and remote teacher setups.