Auto-Research

Twinkle Auto is a terminal-based intelligent training assistant that lets you control, monitor, and debug ML training through natural language. It combines a chat-driven AI agent with an automated health monitor that can detect and fix training failures autonomously.

Architecture Overview

┌──────────────────────────────────────────────────────────┐
│ TwinkleAuto (asyncio chat loop)                          │
│                                                          │
│ Components:                                              │
│   AgentLoop  ─── LLM tool-calling loop                   │
│   TrainingMonitor ─── periodic health check & auto-fix   │
│   LocalConnection ─── file-system based communication   │
│   SkillManager ─── async plugin loading                 │
└──────────────────────────────────────────────────────────┘

Installation & Launch

Auto is part of the twinkle-client package:

pip install twinkle-client

Command-Line Usage

# Basic launch (uses default local Ollama endpoint)
twinkle-auto

# Specify LLM backend
twinkle-auto --llm-base-url http://localhost:11434/v1 --llm-model qwen3.5

# Attach to an existing training run
twinkle-auto --run-id my-grpo-run

# Use a remote API (e.g., OpenAI-compatible)
twinkle-auto --llm-base-url https://api.example.com/v1 --llm-api-key sk-xxx --llm-model gpt-4o

# Enable debug logging
twinkle-auto --verbose

Or run as a Python module:

python -m twinkle_client.auto

CLI Options

Option	Env Var	Default	Description
`--run-id`, `-r`	`TWINKLE_AUTO_RUN_ID`	None	Attach to an existing training run
`--llm-base-url`	`TWINKLE_LLM_BASE_URL`	`http://localhost:11434/v1`	LLM API base URL
`--llm-model`	`TWINKLE_LLM_MODEL`	`qwen3.5`	LLM model name
`--llm-api-key`	`TWINKLE_LLM_API_KEY`	`not-needed`	LLM API key
`--verbose`, `-v`	`TWINKLE_AUTO_VERBOSE`	`False`	Enable DEBUG logging
`--version`, `-V`	—	—	Show version and exit

Chat Agent

The core of Auto is an LLM-powered tool-calling agent (AgentLoop) that processes natural language commands through an OpenAI-compatible API. The agent maintains conversation history with automatic pruning (last 50 messages) and supports up to 10 tool-calling rounds per interaction.

What You Can Say

Training lifecycle:

“List my training runs”
“Start a new GRPO training with Qwen3.5-4B on gsm8k”
“Pause the current run”
“Resume training”
“Stop training”

Server management:

“Start the server with Qwen3.5-4B and a Qwen3.5-72B sampler on 2 GPUs”
“Shut down the server”
“How many GPUs are available?”

Monitoring & analysis:

“How is the training going?”
“Show me the reward-related metrics”
“Zoom into steps 100-200”
“Reset the chart view”

Search:

“Search for math datasets”
“Find Qwen models on ModelScope”

Available Tools

The agent has access to 13 built-in tools:

Tool	Description
`list_training_runs`	List all training runs
`get_training_status`	Get detailed status and recent metrics
`start_server`	Start Ray cluster + Twinkle Server (idempotent)
`shutdown_server`	Shut down server and release GPU resources
`start_training`	Create and launch a new training run
`select_run`	Switch monitoring to a different run
`pause_training`	Pause training (SIGKILL, server retains state)
`resume_training`	Resume by re-launching the client script
`stop_training`	Stop training (SIGTERM, saves checkpoint)
`update_script`	Update training script with version archiving
`list_supported_models`	Query server for available models
`search_datasets`	Search ModelScope for datasets
`search_models`	Search ModelScope for models
`zoom_metrics`	Adjust metrics chart view range
`select_metrics`	Choose which metrics to display (max 4)
`get_cluster_info`	Get GPU/cluster resource info

Server Startup

The start_server tool automates a multi-step pipeline:

GPU detection — nvidia-smi hardware scan
GPU allocation — partition GPUs between training model and samplers
Config generation — auto-create server_config.yaml
Ray cluster startup — multi-node GPU partitioning with isolated CUDA_VISIBLE_DEVICES
Server launch — start Twinkle Server as background process
Health check — poll /api/v1/healthz until ready

Multi-model topology is supported: 1 training model + N sampler/teacher models.

Skills System

Auto supports extensible skill plugins loaded from three sources:

Bundled skills — shipped inside twinkle_client/skills/bundled/
User-local skills — ~/.cache/twinkle/auto/skills/local/
Community skills — fetched from ModelScope (best-effort, 10s timeout)

Skills are loaded asynchronously after startup and injected into the agent’s system prompt. The agent is usable immediately even before skills finish loading.

Training Monitor (Auto-Fix)

The TrainingMonitor is a background service that runs every 30 seconds, collecting all available signals about the current training run and feeding them to the LLM for analysis.

Collected Signals

Process status: alive / dead / unknown
output.log tail: last 1500 chars (prioritizes tracebacks)
Metrics: recent entries + first-half vs second-half trend analysis
Stall duration: seconds since last metric was produced
Current train.py: full script source (for accurate fixes)

Decision Framework

The LLM classifies each check into one of three actions:

Decision	When	Action
LGTM	Training progressing normally	No action
WARNING	Loss plateau, reward hacking, KL explosion, etc.	Relay observation to user
FIX	Script crashed, process dead with traceback	Auto-fix and restart

Auto-Fix Pipeline

When a FIX is needed:

LLM outputs diagnosis + complete fixed script
Monitor archives the old train.py as train_v{N}.py
Writes the fixed script as the new train.py
Re-launches training via resume_training
Resets stall tracking for the new attempt

Safety guardrails:

Max 3 auto-fix attempts per run (prevents infinite retry loops)
Fix attempts are tracked per run_id
Snapshot deduplication avoids re-analyzing unchanged states

File-Based Connection

Auto communicates with training processes through the local filesystem:

~/.cache/twinkle/{run_id}/
├── meta.json       — run metadata (model_id, config, status, pid)
├── metrics.jsonl   — one JSON object per step (incremental)
├── output.log      — combined stdout+stderr from training
├── train.py        — current active training script
└── train_v{N}.py   — archived previous script versions

Training Control Model

In Server Mode, the Twinkle Server retains all model/optimizer state in GPU memory:

Pause = kill client process (SIGKILL) — server state preserved
Resume = re-launch client script — seamlessly continues training
Stop = SIGTERM — triggers checkpoint saving then exits
Shut down server = releases GPU resources, destroys model state

TrainingRuntime (Script Integration)

Training scripts use TrainingRuntime to integrate with Auto:

from twinkle_client.auto.runtime import TrainingRuntime

rt = TrainingRuntime(run_id='my-grpo-run')
rt.start(model_id='Qwen/Qwen3.5-4B', config={'lr': 1e-5})
rt.register_graceful_shutdown(model, dataloader)

for step, batch in enumerate(dataloader):
    # ... training logic ...
    rt.log_metrics(step=step, loss=loss, reward=reward, grad_norm=gn, lr=lr)
    rt.log(f'Completed step {step}, loss={loss:.4f}')

rt.finish()

Key Methods

Method	Description
`start(model_id, config, script_path)`	Initialize run directory and metadata
`log_metrics(**kwargs)`	Write metrics entry to `metrics.jsonl`
`log(message)`	Print log message (captured as `output.log`)
`get_resume_info()`	Get `last_step` for resuming from checkpoint
`finish(status)`	Mark training as finished, close files
`register_graceful_shutdown(model, dataloader)`	Register SIGTERM handler that saves checkpoint

Resume Support

TrainingRuntime automatically saves training progress to meta.json (throttled to every 5 seconds). Scripts can use get_resume_info() to resume from the last saved step:

rt = TrainingRuntime(run_id='my-run')
resume = rt.get_resume_info()
global_step = resume['last_step']

if global_step > 0:
    dataloader.skip_consumed_samples(global_step * BATCH_SIZE)
    print(f'Resuming from step {global_step}')

Graceful Shutdown

When register_graceful_shutdown() is called, a SIGTERM handler is installed that:

Saves model checkpoint (LoRA weights + optimizer state)
Saves dataloader position (consumed_train_samples)
Logs the checkpoint path
Marks training as stopped and exits

Logging

All logs are written to ./auto.log (current working directory):

Rotated at 5MB with 3 backups
No console output — all output goes to the log file
Use --verbose for DEBUG level logging