Reward

Reward functions are components in RLHF training used to evaluate the quality of model outputs. They calculate reward scores based on model-generated trajectories to guide policy learning.

Basic Interface

class Reward:

    def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
        """
        Calculate reward values

        Args:
            trajectories: List of model-generated trajectories
            ground_truths: List of ground truth trajectories

        Returns:
            List of reward values
        """
        ...

MathReward

The math reward function evaluates the correctness of answers to mathematical problems.

from twinkle.reward import MathReward

reward_fn = MathReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

FormatReward

The format reward function checks whether the output conforms to a specified format.

from twinkle.reward import FormatReward

reward_fn = FormatReward()
rewards = reward_fn(trajectories, ground_truths)

Custom Reward Functions

You can create custom rewards by inheriting from the Reward base class or using functions:

from twinkle.reward import Reward
from twinkle.data_format import Trajectory
from typing import List

class CustomReward(Reward):

    def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
        rewards = []
        for traj, gt in zip(trajectories, ground_truths):
            # Custom evaluation logic
            score = self._evaluate(traj, gt)
            rewards.append(score)
        return rewards

    def _evaluate(self, traj, gt):
        # Implement specific evaluation logic
        ...

Or using a function:

def my_reward(trajectories, ground_truths):
    return [1.0 if t == gt else 0.0 for t, gt in zip(trajectories, ground_truths)]

# Use in training
rewards = my_reward(generated, ground_truths)

Usage Scenarios

Typical workflow of reward functions in RLHF training:

from twinkle.sampler import vLLMSampler
from twinkle.reward import MathReward
from twinkle.advantage import GRPOAdvantage

sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = MathReward()
advantage_fn = GRPOAdvantage()

for batch in dataloader:
    # 1. Sample and generate multiple candidate answers
    response = sampler.sample(batch, num_samples=4)

    # 2. Evaluate quality using reward function
    rewards = reward_fn(response.trajectories, batch.ground_truths)

    # 3. Calculate advantages
    advantages = advantage_fn(rewards, num_generations=4)

    # 4. Update policy using advantage values
    ...

The design of reward functions is crucial for RLHF effectiveness. A good reward function should accurately reflect the task objectives and provide clear learning signals.