# Advantage Advantage functions are components in reinforcement learning used to calculate the advantage of an action relative to the average performance. In RLHF training, advantage functions guide policy optimization. ## Basic Interface ```python class Advantage: def __call__(self, rewards: Union['torch.Tensor', List[float]], num_generations: int = 1, scale: Literal['group', 'batch', 'none'] = 'group', **kwargs) -> 'torch.Tensor': """ Calculate advantage values Args: rewards: List or tensor of reward values num_generations: Number of samples generated per prompt scale: Normalization method - 'group': Normalize per group (GRPO) - 'batch': Normalize across entire batch - 'none': No normalization Returns: Advantage tensor """ ... ``` ## Available Advantage Functions Twinkle provides two advantage function implementations: ### GRPOAdvantage GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean. - Simple and efficient, suitable for most scenarios - Reduces variance and improves training stability - Performs relative comparisons within groups See: [GRPOAdvantage](GRPOAdvantage.md) ### RLOOAdvantage RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines. - Theoretically superior, reduces bias - Requires more samples (recommend 8 or more) - More accurate counterfactual baseline estimation See: [RLOOAdvantage](RLOOAdvantage.md) ## How to Choose - **GRPO**: Suitable for scenarios with fewer samples (around 4), high computational efficiency - **RLOO**: Suitable for scenarios with more samples (8 or more), better theoretical performance > The choice of advantage function has a significant impact on RLHF training effectiveness. It's recommended to choose based on computational resources and sample quantity.