Built-in Preprocessors

Twinkle provides a collection of built-in preprocessors for common dataset formats. Each converts raw data into standardized Trajectory objects.

LLM Preprocessors

CompetitionMathProcessor

Converts competition math datasets with problem and solution fields.

dataset.map('CompetitionMathProcessor')
# Input: {'problem': '...', 'solution': '...'}
# Output: Trajectory with user message (problem) and assistant message (solution)

CompetitionMathGRPOProcessor

Similar to CompetitionMathProcessor but stores the solution in user_data for use as ground truth in GRPO reward computation.

dataset.map('CompetitionMathGRPOProcessor')

SelfCognitionProcessor

Replaces template placeholders with model identity information for self-cognition training.

dataset.map('SelfCognitionProcessor', model_name='MyModel', model_author='MyOrg')

AlpacaProcessor

Converts Alpaca-format datasets with instruction, input, and output fields.

dataset.map('AlpacaProcessor')
# Input: {'instruction': '...', 'input': '...', 'output': '...'}

CountdownProcessor

Generates countdown arithmetic problems for reasoning training.

dataset.map('CountdownProcessor')

GSM8KProcessor

Preprocesses GSM8K math datasets, extracting ground truth answers from the #### answer format.

dataset.map('GSM8KProcessor')
# Extracts answer from '#### 42' format and stores in user_data

DPO Preprocessor

EmojiDPOProcessor

Converts emoji-based preference datasets into positive/negative trajectory pairs for DPO training.

dataset.map('EmojiDPOProcessor')
# Input: {'prompt': '...', 'chosen': '...', 'rejected': '...'}
# Output: Interleaved chosen and rejected Trajectory pairs

Multimodal Preprocessors

CLEVRProcessor

Preprocesses CLEVR visual reasoning datasets with image handling.

dataset.map('CLEVRProcessor')
# Input: {'question': '...', 'answer': '...', 'image': PIL.Image}
# Output: Trajectory with multimodal content (image + text)

OlympiadBenchProcessor

Preprocesses OlympiadBench multimodal math/physics problems with image collection and metadata storage.

dataset.map('OlympiadBenchProcessor')
# Handles multiple images per problem, stores ground truth and metadata in user_data

All preprocessors follow the same interface: __call__(rows) -> List[Trajectory]. You can register custom preprocessors following the same pattern (see Preprocessor).