RLOOAdvantage
RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines.
Usage Example
from twinkle.advantage import RLOOAdvantage
advantage_fn = RLOOAdvantage()
rewards = [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]
advantages = advantage_fn(rewards, num_generations=4)
# For each sample, the baseline is the mean of all other samples
# First sample in first group: 0.0 - mean([1.0, 0.0, 1.0]) = 0.0 - 0.667 = -0.667
# ...
How It Works
For each sample, RLOO:
Calculates the mean reward of all other samples in the group (leave-one-out baseline)
Advantage = sample reward - leave-one-out baseline
Optionally normalizes the values
RLOO advantages:
Avoids using the sample’s own information as baseline, reducing bias
More accurate counterfactual baseline estimation
Better performance when there are more samples
Training Example
from twinkle.advantage import RLOOAdvantage
from twinkle.model import TransformersModel
from twinkle.sampler import vLLMSampler
from twinkle.reward import MathReward
# Create components
actor = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = MathReward()
advantage_fn = RLOOAdvantage()
dataloader = ...
# Training loop
for batch in dataloader:
# 1. Sample generation (generate more samples to improve RLOO effectiveness)
response = sampler.sample(batch, num_samples=8)
# 2. Calculate rewards
rewards = reward_fn(response.trajectories, batch.ground_truths)
# 3. Calculate advantages
advantages = advantage_fn(rewards, num_generations=8)
# 4. Policy optimization
loss = actor.forward_backward(
inputs=response.inputs,
advantages=advantages
)
actor.clip_grad_and_step()
RLOO is theoretically superior but requires more samples (recommend 8 or more samples per prompt).