Advantage

Advantage functions are components in reinforcement learning used to calculate the advantage of an action relative to the average performance. In RLHF training, advantage functions guide policy optimization.

Basic Interface

class Advantage:

    def __call__(self,
                 rewards: Union['torch.Tensor', List[float]],
                 num_generations: int = 1,
                 scale: Literal['group', 'batch', 'none'] = 'group',
                 **kwargs) -> 'torch.Tensor':
        """
        Calculate advantage values

        Args:
            rewards: List or tensor of reward values
            num_generations: Number of samples generated per prompt
            scale: Normalization method
                - 'group': Normalize per group (GRPO)
                - 'batch': Normalize across entire batch
                - 'none': No normalization

        Returns:
            Advantage tensor
        """
        ...

Available Advantage Functions

Twinkle provides two advantage function implementations:

GRPOAdvantage

GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean.

  • Simple and efficient, suitable for most scenarios

  • Reduces variance and improves training stability

  • Performs relative comparisons within groups

See: GRPOAdvantage

RLOOAdvantage

RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines.

  • Theoretically superior, reduces bias

  • Requires more samples (recommend 8 or more)

  • More accurate counterfactual baseline estimation

See: RLOOAdvantage

How to Choose

  • GRPO: Suitable for scenarios with fewer samples (around 4), high computational efficiency

  • RLOO: Suitable for scenarios with more samples (8 or more), better theoretical performance

The choice of advantage function has a significant impact on RLHF training effectiveness. It’s recommended to choose based on computational resources and sample quantity.