Reward
Reward functions are components in RLHF training used to evaluate the quality of model outputs. They calculate reward scores based on model-generated trajectories to guide policy learning.
Basic Interface
class Reward:
def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
"""
Calculate reward values
Args:
trajectories: List of model-generated trajectories
ground_truths: List of ground truth trajectories
Returns:
List of reward values
"""
...
MathReward
The math reward function evaluates the correctness of answers to mathematical problems.
from twinkle.reward import MathReward
reward_fn = MathReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect
FormatReward
The format reward function checks whether the output conforms to a specified format.
from twinkle.reward import FormatReward
reward_fn = FormatReward()
rewards = reward_fn(trajectories, ground_truths)
Custom Reward Functions
You can create custom rewards by inheriting from the Reward base class or using functions:
from twinkle.reward import Reward
from twinkle.data_format import Trajectory
from typing import List
class CustomReward(Reward):
def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
rewards = []
for traj, gt in zip(trajectories, ground_truths):
# Custom evaluation logic
score = self._evaluate(traj, gt)
rewards.append(score)
return rewards
def _evaluate(self, traj, gt):
# Implement specific evaluation logic
...
Or using a function:
def my_reward(trajectories, ground_truths):
return [1.0 if t == gt else 0.0 for t, gt in zip(trajectories, ground_truths)]
# Use in training
rewards = my_reward(generated, ground_truths)
Usage Scenarios
Typical workflow of reward functions in RLHF training:
from twinkle.sampler import vLLMSampler
from twinkle.reward import MathReward
from twinkle.advantage import GRPOAdvantage
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = MathReward()
advantage_fn = GRPOAdvantage()
for batch in dataloader:
# 1. Sample and generate multiple candidate answers
response = sampler.sample(batch, num_samples=4)
# 2. Evaluate quality using reward function
rewards = reward_fn(response.trajectories, batch.ground_truths)
# 3. Calculate advantages
advantages = advantage_fn(rewards, num_generations=4)
# 4. Update policy using advantage values
...
The design of reward functions is crucial for RLHF effectiveness. A good reward function should accurately reflect the task objectives and provide clear learning signals.