Advantage

Advantage functions are components in reinforcement learning used to calculate the advantage of an action relative to the average performance. In RLHF training, advantage functions guide policy optimization.

Basic Interface

class Advantage:

    def __call__(self,
                 rewards: Union['torch.Tensor', List[float]],
                 num_generations: int = 1,
                 scale: Literal['group', 'batch', 'none'] = 'group',
                 **kwargs) -> 'torch.Tensor':
        """
        Calculate advantage values

        Args:
            rewards: List or tensor of reward values
            num_generations: Number of samples generated per prompt
            scale: Normalization method
                - 'group': Normalize per group (GRPO)
                - 'batch': Normalize across entire batch
                - 'none': No normalization

        Returns:
            Advantage tensor
        """
        ...

Available Advantage Functions

Twinkle provides two advantage function implementations:

GRPOAdvantage

GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean.

Simple and efficient, suitable for most scenarios
Reduces variance and improves training stability
Performs relative comparisons within groups

See: GRPOAdvantage

RLOOAdvantage

RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines.

Theoretically superior, reduces bias
Requires more samples (recommend 8 or more)
More accurate counterfactual baseline estimation

See: RLOOAdvantage

How to Choose

GRPO: Suitable for scenarios with fewer samples (around 4), high computational efficiency
RLOO: Suitable for scenarios with more samples (8 or more), better theoretical performance

The choice of advantage function has a significant impact on RLHF training effectiveness. It’s recommended to choose based on computational resources and sample quantity.