Supported Models
Twinkle supports any model compatible with HuggingFace Transformers or Megatron-LM. Below is a curated list of models tested with Twinkle.
Language Models
| Model Family | Model IDs | Parameters | Features |
|---|---|---|---|
| Qwen 3.5 | Qwen/Qwen3.5-0.6B ~ Qwen/Qwen3.5-235B-A22B |
0.6B–235B | MoE, Thinking mode |
| Qwen 2.5 | Qwen/Qwen2.5-0.5B ~ Qwen/Qwen2.5-72B |
0.5B–72B | Dense |
| DeepSeek V4 | deepseek-ai/DeepSeek-V4 |
685B MoE | Custom DSML encoding |
| DeepSeek R1 | deepseek-ai/DeepSeek-R1 |
685B MoE | Reasoning |
| LLaMA 3 | meta-llama/Llama-3.3-70B-Instruct |
8B–70B | Dense |
| Mistral | mistralai/Mistral-7B-v0.3 |
7B | Dense |
| Yi | 01-ai/Yi-1.5-34B |
6B–34B | Dense |
| GLM-4 | THUDM/glm-4-9b-chat |
9B | Dense |
| InternLM 2.5 | internlm/internlm2_5-7b-chat |
7B–20B | Dense |
Vision-Language Models
| Model Family | Model IDs | Features |
|---|---|---|
| Qwen 3.5 VL | Qwen/Qwen3.5-VL-3B ~ Qwen/Qwen3.5-VL-72B |
Image, Video |
| Qwen 2.5 VL | Qwen/Qwen2.5-VL-7B-Instruct |
Image, Video |
| InternVL 2.5 | OpenGVLab/InternVL2_5-8B |
Image |
Embedding Models
| Model Family | Model IDs | Training Method |
|---|---|---|
| Qwen3 Embedding | Qwen/Qwen3-Embedding-0.6B |
InfoNCE contrastive |
| GTE | thenlper/gte-large-zh |
InfoNCE contrastive |
Model Loading
Models can be loaded from ModelScope or HuggingFace:
from twinkle.model import TransformersModel
# From ModelScope (ms:// prefix)
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
# From HuggingFace (hf:// prefix)
model = TransformersModel(model_id='hf://meta-llama/Llama-3.3-70B-Instruct')
# Local path
model = TransformersModel(model_id='/path/to/model')
Framework Support
| Framework | Class | Use Case |
|---|---|---|
| Transformers | TransformersModel |
General training (SFT, RLHF, DPO) |
| Transformers + Multi-LoRA | MultiLoraTransformersModel |
Multi-tenant training |
| Megatron-LM | MegatronModel |
Large-scale distributed pre-training |
| Megatron + Multi-LoRA | MultiLoraMegatronModel |
Large-scale multi-tenant |
Precision Support
| Mode | Description |
|---|---|
bf16 |
BFloat16 mixed precision (recommended for A100/H100) |
fp16 |
Float16 mixed precision (for older GPUs) |
fp8 |
FP8 precision (H100 with Transformer Engine) |
no |
Full precision (debugging only) |
Parallelism Strategies
| Strategy | Config Key | Description |
|---|---|---|
| FSDP | strategy=accelerate |
Accelerate-managed FSDP (default) |
| Native FSDP | strategy=native_fsdp |
PyTorch native FSDP |
| Tensor Parallel | tp_size |
Split layers across GPUs |
| Pipeline Parallel | pp_size |
Split model stages |
| Data Parallel | dp_size |
Replicate model, split data |
| Sequence Parallel | sequence_parallel |
Split long sequences |
| Expert Parallel | ep_size |
MoE expert distribution |