GRPOAdvantage
GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean.
Usage Example
from twinkle.advantage import GRPOAdvantage
advantage_fn = GRPOAdvantage()
# Assume 2 prompts, each generating 4 samples
rewards = [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0] # 8 reward values
advantages = advantage_fn(rewards, num_generations=4, scale='group')
# Advantages will be each group minus the group mean:
# Group 1: [0.0-0.5, 1.0-0.5, 0.0-0.5, 1.0-0.5] = [-0.5, 0.5, -0.5, 0.5]
# Group 2: [1.0-0.25, 0.0-0.25, 0.0-0.25, 0.0-0.25] = [0.75, -0.25, -0.25, -0.25]
How It Works
GRPO groups samples (each group corresponds to multiple generations from one prompt), then within each group:
Calculate the group mean reward
Advantage for each sample = reward - group mean
Optionally normalize the advantage values
This method:
Reduces variance and improves training stability
Performs relative comparisons within groups, better aligned with relative nature of human preferences
Avoids the impact of reward scale
Complete Training Example
Using the advantage function in GRPO training:
from twinkle.advantage import GRPOAdvantage
from twinkle.model import TransformersModel
from twinkle.sampler import vLLMSampler
# Create components
actor = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = ...
advantage_fn = GRPOAdvantage()
# Training loop
for batch in dataloader:
# Sample generation
sample_response = sampler.sample(batch, num_samples=4)
input_data = [seq.new_input_feature for response in sample_response for seq in response.sequences]
...
rewards = reward_fn(...)
# Calculate advantages
advantages = advantage_fn(rewards, num_generations=4)
# 4. Policy optimization
loss = actor.forward_backward(
inputs=input_data,
advantages=advantages
)
actor.clip_grad_and_step()
The GRPO method is simple and efficient, suitable for most RLHF training scenarios.