GSM8K Reward

Reward functions specifically designed for evaluating GSM8K math problem solutions.

GSM8KAccuracyReward

Evaluates the correctness of GSM8K answers by extracting boxed or hash-formatted (####) answers and performing numeric/string comparison.

from twinkle.reward import GSM8KAccuracyReward

reward_fn = GSM8KAccuracyReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

The reward function:

  1. Extracts the answer from \boxed{...} or #### ... format in the model’s completion

  2. Extracts the ground truth answer from the reference trajectory

  3. Performs numeric comparison (with tolerance) or exact string matching

GSM8KFormatReward

Checks whether the model output contains a properly formatted answer section.

from twinkle.reward import GSM8KFormatReward

reward_fn = GSM8KFormatReward()
rewards = reward_fn(trajectories, ground_truths)
# rewards: List[float], 1.0 if format is valid, 0.0 otherwise

Use GSM8KAccuracyReward and GSM8KFormatReward together as a composite reward for GRPO training on math problem solving tasks.