# GRPOAdvantage GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean. ## Usage Example ```python from twinkle.advantage import GRPOAdvantage advantage_fn = GRPOAdvantage() # Assume 2 prompts, each generating 4 samples rewards = [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0] # 8 reward values advantages = advantage_fn(rewards, num_generations=4, scale='group') # Advantages will be each group minus the group mean: # Group 1: [0.0-0.5, 1.0-0.5, 0.0-0.5, 1.0-0.5] = [-0.5, 0.5, -0.5, 0.5] # Group 2: [1.0-0.25, 0.0-0.25, 0.0-0.25, 0.0-0.25] = [0.75, -0.25, -0.25, -0.25] ``` ## How It Works GRPO groups samples (each group corresponds to multiple generations from one prompt), then within each group: 1. Calculate the group mean reward 2. Advantage for each sample = reward - group mean 3. Optionally normalize the advantage values This method: - Reduces variance and improves training stability - Performs relative comparisons within groups, better aligned with relative nature of human preferences - Avoids the impact of reward scale ## Complete Training Example Using the advantage function in GRPO training: ```python from twinkle.advantage import GRPOAdvantage from twinkle.model import TransformersModel from twinkle.sampler import vLLMSampler # Create components actor = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B') sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B') reward_fn = ... advantage_fn = GRPOAdvantage() # Training loop for batch in dataloader: # Sample generation sample_response = sampler.sample(batch, num_samples=4) input_data = [seq.new_input_feature for response in sample_response for seq in response.sequences] ... rewards = reward_fn(...) # Calculate advantages advantages = advantage_fn(rewards, num_generations=4) # 4. Policy optimization loss = actor.forward_backward( inputs=input_data, advantages=advantages ) actor.clip_grad_and_step() ``` > The GRPO method is simple and efficient, suitable for most RLHF training scenarios.