# Auto-Research Twinkle Auto is a terminal-based intelligent training assistant that lets you **control, monitor, and debug ML training through natural language**. It combines a chat-driven AI agent with an automated health monitor that can detect and fix training failures autonomously. ## Architecture Overview ``` ┌──────────────────────────────────────────────────────────┐ │ TwinkleAuto (asyncio chat loop) │ │ │ │ Components: │ │ AgentLoop ─── LLM tool-calling loop │ │ TrainingMonitor ─── periodic health check & auto-fix │ │ LocalConnection ─── file-system based communication │ │ SkillManager ─── async plugin loading │ └──────────────────────────────────────────────────────────┘ ``` ## Installation & Launch Auto is part of the `twinkle-client` package: ```bash pip install twinkle-client ``` ### Command-Line Usage ```bash # Basic launch (uses default local Ollama endpoint) twinkle-auto # Specify LLM backend twinkle-auto --llm-base-url http://localhost:11434/v1 --llm-model qwen3.5 # Attach to an existing training run twinkle-auto --run-id my-grpo-run # Use a remote API (e.g., OpenAI-compatible) twinkle-auto --llm-base-url https://api.example.com/v1 --llm-api-key sk-xxx --llm-model gpt-4o # Enable debug logging twinkle-auto --verbose ``` Or run as a Python module: ```bash python -m twinkle_client.auto ``` ### CLI Options | Option | Env Var | Default | Description | |--------|---------|---------|-------------| | `--run-id`, `-r` | `TWINKLE_AUTO_RUN_ID` | None | Attach to an existing training run | | `--llm-base-url` | `TWINKLE_LLM_BASE_URL` | `http://localhost:11434/v1` | LLM API base URL | | `--llm-model` | `TWINKLE_LLM_MODEL` | `qwen3.5` | LLM model name | | `--llm-api-key` | `TWINKLE_LLM_API_KEY` | `not-needed` | LLM API key | | `--verbose`, `-v` | `TWINKLE_AUTO_VERBOSE` | `False` | Enable DEBUG logging | | `--version`, `-V` | — | — | Show version and exit | ## Chat Agent The core of Auto is an **LLM-powered tool-calling agent** (`AgentLoop`) that processes natural language commands through an OpenAI-compatible API. The agent maintains conversation history with automatic pruning (last 50 messages) and supports up to 10 tool-calling rounds per interaction. ### What You Can Say **Training lifecycle:** - *"List my training runs"* - *"Start a new GRPO training with Qwen3.5-4B on gsm8k"* - *"Pause the current run"* - *"Resume training"* - *"Stop training"* **Server management:** - *"Start the server with Qwen3.5-4B and a Qwen3.5-72B sampler on 2 GPUs"* - *"Shut down the server"* - *"How many GPUs are available?"* **Monitoring & analysis:** - *"How is the training going?"* - *"Show me the reward-related metrics"* - *"Zoom into steps 100-200"* - *"Reset the chart view"* **Search:** - *"Search for math datasets"* - *"Find Qwen models on ModelScope"* ### Available Tools The agent has access to 13 built-in tools: | Tool | Description | |------|-------------| | `list_training_runs` | List all training runs | | `get_training_status` | Get detailed status and recent metrics | | `start_server` | Start Ray cluster + Twinkle Server (idempotent) | | `shutdown_server` | Shut down server and release GPU resources | | `start_training` | Create and launch a new training run | | `select_run` | Switch monitoring to a different run | | `pause_training` | Pause training (SIGKILL, server retains state) | | `resume_training` | Resume by re-launching the client script | | `stop_training` | Stop training (SIGTERM, saves checkpoint) | | `update_script` | Update training script with version archiving | | `list_supported_models` | Query server for available models | | `search_datasets` | Search ModelScope for datasets | | `search_models` | Search ModelScope for models | | `zoom_metrics` | Adjust metrics chart view range | | `select_metrics` | Choose which metrics to display (max 4) | | `get_cluster_info` | Get GPU/cluster resource info | ### Server Startup The `start_server` tool automates a multi-step pipeline: 1. **GPU detection** — `nvidia-smi` hardware scan 2. **GPU allocation** — partition GPUs between training model and samplers 3. **Config generation** — auto-create `server_config.yaml` 4. **Ray cluster startup** — multi-node GPU partitioning with isolated `CUDA_VISIBLE_DEVICES` 5. **Server launch** — start Twinkle Server as background process 6. **Health check** — poll `/api/v1/healthz` until ready Multi-model topology is supported: 1 training model + N sampler/teacher models. ### Skills System Auto supports extensible skill plugins loaded from three sources: 1. **Bundled skills** — shipped inside `twinkle_client/skills/bundled/` 2. **User-local skills** — `~/.cache/twinkle/auto/skills/local/` 3. **Community skills** — fetched from ModelScope (best-effort, 10s timeout) Skills are loaded asynchronously after startup and injected into the agent's system prompt. The agent is usable immediately even before skills finish loading. ## Training Monitor (Auto-Fix) The `TrainingMonitor` is a background service that runs every **30 seconds**, collecting all available signals about the current training run and feeding them to the LLM for analysis. ### Collected Signals - **Process status**: alive / dead / unknown - **output.log tail**: last 1500 chars (prioritizes tracebacks) - **Metrics**: recent entries + first-half vs second-half trend analysis - **Stall duration**: seconds since last metric was produced - **Current train.py**: full script source (for accurate fixes) ### Decision Framework The LLM classifies each check into one of three actions: | Decision | When | Action | |----------|------|--------| | **LGTM** | Training progressing normally | No action | | **WARNING** | Loss plateau, reward hacking, KL explosion, etc. | Relay observation to user | | **FIX** | Script crashed, process dead with traceback | Auto-fix and restart | ### Auto-Fix Pipeline When a FIX is needed: 1. LLM outputs diagnosis + complete fixed script 2. Monitor archives the old `train.py` as `train_v{N}.py` 3. Writes the fixed script as the new `train.py` 4. Re-launches training via `resume_training` 5. Resets stall tracking for the new attempt Safety guardrails: - Max **3 auto-fix attempts** per run (prevents infinite retry loops) - Fix attempts are tracked per `run_id` - Snapshot deduplication avoids re-analyzing unchanged states ## File-Based Connection Auto communicates with training processes through the local filesystem: ``` ~/.cache/twinkle/{run_id}/ ├── meta.json — run metadata (model_id, config, status, pid) ├── metrics.jsonl — one JSON object per step (incremental) ├── output.log — combined stdout+stderr from training ├── train.py — current active training script └── train_v{N}.py — archived previous script versions ``` ### Training Control Model In Server Mode, the Twinkle Server retains all model/optimizer state in GPU memory: - **Pause** = kill client process (SIGKILL) — server state preserved - **Resume** = re-launch client script — seamlessly continues training - **Stop** = SIGTERM — triggers checkpoint saving then exits - **Shut down server** = releases GPU resources, **destroys** model state ## TrainingRuntime (Script Integration) Training scripts use `TrainingRuntime` to integrate with Auto: ```python from twinkle_client.auto.runtime import TrainingRuntime rt = TrainingRuntime(run_id='my-grpo-run') rt.start(model_id='Qwen/Qwen3.5-4B', config={'lr': 1e-5}) rt.register_graceful_shutdown(model, dataloader) for step, batch in enumerate(dataloader): # ... training logic ... rt.log_metrics(step=step, loss=loss, reward=reward, grad_norm=gn, lr=lr) rt.log(f'Completed step {step}, loss={loss:.4f}') rt.finish() ``` ### Key Methods | Method | Description | |--------|-------------| | `start(model_id, config, script_path)` | Initialize run directory and metadata | | `log_metrics(**kwargs)` | Write metrics entry to `metrics.jsonl` | | `log(message)` | Print log message (captured as `output.log`) | | `get_resume_info()` | Get `last_step` for resuming from checkpoint | | `finish(status)` | Mark training as finished, close files | | `register_graceful_shutdown(model, dataloader)` | Register SIGTERM handler that saves checkpoint | ### Resume Support `TrainingRuntime` automatically saves training progress to `meta.json` (throttled to every 5 seconds). Scripts can use `get_resume_info()` to resume from the last saved step: ```python rt = TrainingRuntime(run_id='my-run') resume = rt.get_resume_info() global_step = resume['last_step'] if global_step > 0: dataloader.skip_consumed_samples(global_step * BATCH_SIZE) print(f'Resuming from step {global_step}') ``` ### Graceful Shutdown When `register_graceful_shutdown()` is called, a SIGTERM handler is installed that: 1. Saves model checkpoint (LoRA weights + optimizer state) 2. Saves dataloader position (`consumed_train_samples`) 3. Logs the checkpoint path 4. Marks training as `stopped` and exits ## Logging All logs are written to `./auto.log` (current working directory): - Rotated at 5MB with 3 backups - **No console output** — all output goes to the log file - Use `--verbose` for DEBUG level logging