GSM8K Reward

Reward functions specifically designed for evaluating GSM8K math problem solutions.

GSM8KAccuracyReward

Evaluates the correctness of GSM8K answers by extracting boxed or hash-formatted (####) answers and performing numeric/string comparison.

from twinkle.reward import GSM8KAccuracyReward

reward_fn = GSM8KAccuracyReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

The reward function:

Extracts the answer from \boxed{...} or #### ... format in the model’s completion
Extracts the ground truth answer from the reference trajectory
Performs numeric comparison (with tolerance) or exact string matching

GSM8KFormatReward

Checks whether the model output contains a properly formatted answer section.

from twinkle.reward import GSM8KFormatReward

reward_fn = GSM8KFormatReward()
rewards = reward_fn(trajectories, ground_truths)
# rewards: List[float], 1.0 if format is valid, 0.0 otherwise

Use GSM8KAccuracyReward and GSM8KFormatReward together as a composite reward for GRPO training on math problem solving tasks.