Advantage
Advantage functions are components in reinforcement learning used to calculate the advantage of an action relative to the average performance. In RLHF training, advantage functions guide policy optimization.
Basic Interface
class Advantage:
def __call__(self,
rewards: Union['torch.Tensor', List[float]],
num_generations: int = 1,
scale: Literal['group', 'batch', 'none'] = 'group',
**kwargs) -> 'torch.Tensor':
"""
Calculate advantage values
Args:
rewards: List or tensor of reward values
num_generations: Number of samples generated per prompt
scale: Normalization method
- 'group': Normalize per group (GRPO)
- 'batch': Normalize across entire batch
- 'none': No normalization
Returns:
Advantage tensor
"""
...
Available Advantage Functions
Twinkle provides two advantage function implementations:
GRPOAdvantage
GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean.
Simple and efficient, suitable for most scenarios
Reduces variance and improves training stability
Performs relative comparisons within groups
See: GRPOAdvantage
RLOOAdvantage
RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines.
Theoretically superior, reduces bias
Requires more samples (recommend 8 or more)
More accurate counterfactual baseline estimation
See: RLOOAdvantage
How to Choose
GRPO: Suitable for scenarios with fewer samples (around 4), high computational efficiency
RLOO: Suitable for scenarios with more samples (8 or more), better theoretical performance
The choice of advantage function has a significant impact on RLHF training effectiveness. It’s recommended to choose based on computational resources and sample quantity.