Agentic Preprocessor

The agentic preprocessor module provides a pipeline-based data quality filtering framework for multi-turn conversation datasets. It is designed for cleaning and filtering training data before RLHF / agentic fine-tuning.

QualityPreprocessor

QualityPreprocessor is a thin pipeline runner that accepts a list of filter callables and runs them in sequence. Each step receives a list of rows, returns (kept, dropped), and the pipeline logs per-step statistics.

from twinkle_agentic.preprocessor import QualityPreprocessor, HardFilter, DeadLoopFilter

pipeline = [
    HardFilter(min_user_chars=10),
    DeadLoopFilter(),
]
preprocessor = QualityPreprocessor(pipeline, dropped_log_path='dropped.jsonl')

# rows is a dict of columns (Dataset.map format)
cleaned = preprocessor(rows)

Parameters

Parameter	Type	Description
`pipeline`	`List[Callable]`	Ordered list of filter steps. Each step takes `List[Dict]` and returns `(kept, dropped)`.
`dropped_log_path`	`str`	Optional JSONL file path for logging dropped rows with step name and reason.

Built-in Filters

HardFilter

Rule-based filter that removes trivially bad rows using deterministic rules. Supports multi-language detection (EN/ZH/JA/KO).

from twinkle_agentic.preprocessor import HardFilter

f = HardFilter(
    min_user_chars=10,           # Min chars for non-CJK user query
    min_user_chars_cjk=6,        # Min chars for CJK user query
    min_assistant_chars_2turn=80, # Min assistant reply length (2-turn)
    min_thinking_chars=200,      # Min thinking chain length to exempt
    system_deny_keywords=['hack', 'exploit'],
    max_chars_per_round=50000,
    max_total_chars=200000,
    max_rounds=50,
)

Drop reasons: trivial_single_turn, shallow_reply, all_empty_assistant, system_deny_keyword, round_too_long, total_too_long, too_many_rounds

DeadLoopFilter

Detects assistant messages exhibiting hesitation or dead-loop patterns — repetitive self-corrections, cascading corrections, and high n-gram repetition.

from twinkle_agentic.preprocessor import DeadLoopFilter

f = DeadLoopFilter(
    hesitation_density_threshold=7.0,   # Markers per 1000 chars (response)
    cascade_threshold=5,                 # Cascade markers in window
    cascade_window=800,                  # Window size in chars
    repetition_threshold=0.45,           # N-gram repetition ratio
    think_hesitation_density_threshold=15.0,  # Laxer for <think> blocks
    think_repetition_threshold=0.65,
)

Uses separate threshold profiles for <think> reasoning blocks (laxer, free to ramble) and visible response (stricter).

DedupFilter

Global longest-wins deduplication. The signature is derived from the first real user turn (head+tail) and the first assistant reply.

from twinkle_agentic.preprocessor import DedupFilter

f = DedupFilter(prefix_chars=100, asst_chars=100)
kept, dropped = f(all_rows)  # Must see entire dataset in one call

Note: DedupFilter requires the full dataset in a single call. Do not place it inside QualityPreprocessor (which processes per-batch). Run it separately before or after the pipeline.

RefuseFilter

Detects self-referential refusals in the first assistant reply (e.g., “I cannot help with that”). Multi-language pattern matching (EN/ZH/JA/KO).

from twinkle_agentic.preprocessor import RefuseFilter

f = RefuseFilter(check_window=600)  # Only check first N chars

TokenSoupFilter

Detects garbled / token-soup output by checking for replacement characters, control characters, private-use Unicode, leaked special tokens, single-character repetition, and script chaos.

from twinkle_agentic.preprocessor import TokenSoupFilter

f = TokenSoupFilter(
    replacement_char_ratio=0.02,
    special_token_count=20,
    script_chaos_threshold=0.55,
)

PIIPresidioFilter

Multi-language PII detection and rewriting using Microsoft Presidio + spaCy NER + Faker. Detects and replaces personal identifiable information (names, emails, phone numbers, addresses, etc.).

from twinkle_agentic.preprocessor import PIIPresidioFilter

f = PIIPresidioFilter(languages=['en', 'zh'])

IntentClassifier

Heuristic intent classifier that tags each row with detected intents. Pluggable detector pipeline.

from twinkle_agentic.preprocessor import IntentClassifier

classifier = IntentClassifier()

Intent categories: tool_call, code, math, complex_logic, reasoning, user_dissatisfaction, other

ScoreFilter

Pluggable scorer-based filter with built-in scorers for character-level metrics, semantic similarity, and code execution.

from twinkle_agentic.preprocessor import ScoreFilter

f = ScoreFilter()

Built-in scorers: ChrMinScorer, SIFDScorer, PassNScorer, ParaphraseScorer

ModelFilter

Filters rows by model ID whitelist.

from twinkle_agentic.preprocessor import ModelFilter

f = ModelFilter(allowed_models=['qwen3.5-4b', 'qwen3.5-32b'])

MessageNormalizer

Three-pass message normalization: heartbeat stripping, tool-call rewriting, and consecutive same-role message merging.

from twinkle_agentic.preprocessor import MessageNormalizer

normalizer = MessageNormalizer()

Complete Pipeline Example

from twinkle_agentic.preprocessor import (
    QualityPreprocessor,
    HardFilter,
    DeadLoopFilter,
    RefuseFilter,
    TokenSoupFilter,
    MessageNormalizer,
    DedupFilter,
)

# Step 1: Global dedup (must run on full dataset)
dedup = DedupFilter()
rows, _ = dedup(all_rows)

# Step 2: Per-batch pipeline
pipeline = [
    HardFilter(min_user_chars=10, max_rounds=30),
    DeadLoopFilter(),
    RefuseFilter(),
    TokenSoupFilter(),
    MessageNormalizer(),
]
preprocessor = QualityPreprocessor(pipeline, dropped_log_path='dropped.jsonl')
cleaned = preprocessor(rows)