# Agentic Preprocessor

The agentic preprocessor module provides a pipeline-based data quality filtering framework for multi-turn conversation datasets. It is designed for cleaning and filtering training data before RLHF / agentic fine-tuning.

## QualityPreprocessor

`QualityPreprocessor` is a thin pipeline runner that accepts a list of filter callables and runs them in sequence. Each step receives a list of rows, returns `(kept, dropped)`, and the pipeline logs per-step statistics.

```python
from twinkle_agentic.preprocessor import QualityPreprocessor, HardFilter, DeadLoopFilter

pipeline = [
    HardFilter(min_user_chars=10),
    DeadLoopFilter(),
]
preprocessor = QualityPreprocessor(pipeline, dropped_log_path='dropped.jsonl')

# rows is a dict of columns (Dataset.map format)
cleaned = preprocessor(rows)
```

### Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `pipeline` | `List[Callable]` | Ordered list of filter steps. Each step takes `List[Dict]` and returns `(kept, dropped)`. |
| `dropped_log_path` | `str` | Optional JSONL file path for logging dropped rows with step name and reason. |

## Built-in Filters

### HardFilter

Rule-based filter that removes trivially bad rows using deterministic rules. Supports multi-language detection (EN/ZH/JA/KO).

```python
from twinkle_agentic.preprocessor import HardFilter

f = HardFilter(
    min_user_chars=10,           # Min chars for non-CJK user query
    min_user_chars_cjk=6,        # Min chars for CJK user query
    min_assistant_chars_2turn=80, # Min assistant reply length (2-turn)
    min_thinking_chars=200,      # Min thinking chain length to exempt
    system_deny_keywords=['hack', 'exploit'],
    max_chars_per_round=50000,
    max_total_chars=200000,
    max_rounds=50,
)
```

**Drop reasons:** `trivial_single_turn`, `shallow_reply`, `all_empty_assistant`, `system_deny_keyword`, `round_too_long`, `total_too_long`, `too_many_rounds`

### DeadLoopFilter

Detects assistant messages exhibiting hesitation or dead-loop patterns — repetitive self-corrections, cascading corrections, and high n-gram repetition.

```python
from twinkle_agentic.preprocessor import DeadLoopFilter

f = DeadLoopFilter(
    hesitation_density_threshold=7.0,   # Markers per 1000 chars (response)
    cascade_threshold=5,                 # Cascade markers in window
    cascade_window=800,                  # Window size in chars
    repetition_threshold=0.45,           # N-gram repetition ratio
    think_hesitation_density_threshold=15.0,  # Laxer for <think> blocks
    think_repetition_threshold=0.65,
)
```

Uses separate threshold profiles for `<think>` reasoning blocks (laxer, free to ramble) and visible response (stricter).

### DedupFilter

Global longest-wins deduplication. The signature is derived from the first real user turn (head+tail) and the first assistant reply.

```python
from twinkle_agentic.preprocessor import DedupFilter

f = DedupFilter(prefix_chars=100, asst_chars=100)
kept, dropped = f(all_rows)  # Must see entire dataset in one call
```

> **Note:** `DedupFilter` requires the full dataset in a single call. Do **not** place it inside `QualityPreprocessor` (which processes per-batch). Run it separately before or after the pipeline.

### RefuseFilter

Detects self-referential refusals in the first assistant reply (e.g., "I cannot help with that"). Multi-language pattern matching (EN/ZH/JA/KO).

```python
from twinkle_agentic.preprocessor import RefuseFilter

f = RefuseFilter(check_window=600)  # Only check first N chars
```

### TokenSoupFilter

Detects garbled / token-soup output by checking for replacement characters, control characters, private-use Unicode, leaked special tokens, single-character repetition, and script chaos.

```python
from twinkle_agentic.preprocessor import TokenSoupFilter

f = TokenSoupFilter(
    replacement_char_ratio=0.02,
    special_token_count=20,
    script_chaos_threshold=0.55,
)
```

### PIIPresidioFilter

Multi-language PII detection and rewriting using Microsoft Presidio + spaCy NER + Faker. Detects and replaces personal identifiable information (names, emails, phone numbers, addresses, etc.).

```python
from twinkle_agentic.preprocessor import PIIPresidioFilter

f = PIIPresidioFilter(languages=['en', 'zh'])
```

### IntentClassifier

Heuristic intent classifier that tags each row with detected intents. Pluggable detector pipeline.

```python
from twinkle_agentic.preprocessor import IntentClassifier

classifier = IntentClassifier()
```

**Intent categories:** `tool_call`, `code`, `math`, `complex_logic`, `reasoning`, `user_dissatisfaction`, `other`

### ScoreFilter

Pluggable scorer-based filter with built-in scorers for character-level metrics, semantic similarity, and code execution.

```python
from twinkle_agentic.preprocessor import ScoreFilter

f = ScoreFilter()
```

**Built-in scorers:** `ChrMinScorer`, `SIFDScorer`, `PassNScorer`, `ParaphraseScorer`

### ModelFilter

Filters rows by model ID whitelist.

```python
from twinkle_agentic.preprocessor import ModelFilter

f = ModelFilter(allowed_models=['qwen3.5-4b', 'qwen3.5-32b'])
```

### MessageNormalizer

Three-pass message normalization: heartbeat stripping, tool-call rewriting, and consecutive same-role message merging.

```python
from twinkle_agentic.preprocessor import MessageNormalizer

normalizer = MessageNormalizer()
```

## Complete Pipeline Example

```python
from twinkle_agentic.preprocessor import (
    QualityPreprocessor,
    HardFilter,
    DeadLoopFilter,
    RefuseFilter,
    TokenSoupFilter,
    MessageNormalizer,
    DedupFilter,
)

# Step 1: Global dedup (must run on full dataset)
dedup = DedupFilter()
rows, _ = dedup(all_rows)

# Step 2: Per-batch pipeline
pipeline = [
    HardFilter(min_user_chars=10, max_rounds=30),
    DeadLoopFilter(),
    RefuseFilter(),
    TokenSoupFilter(),
    MessageNormalizer(),
]
preprocessor = QualityPreprocessor(pipeline, dropped_log_path='dropped.jsonl')
cleaned = preprocessor(rows)
```