Qwen3.5 Training Best Practices
Using Qwen3.5-4B as an example, this guide demonstrates the core capability of the Twinkle framework: one component-based code, used from single GPU training to Client-Server mode.
1. What is Twinkle
Twinkle is a production-oriented large model training framework. Its core design is straightforward: training logic is expressed in Python code, and the runtime mode is switched via initialization parameters.
This means:
A training script written in the lab can be used to ray and server training by changing a single line
Open to customize your training algorithm
No need to maintain separate codebases to support different modes like torchrun, Ray, or HTTP
Algorithm engineers focus on training logic; the framework handles distributed communication automatically
Twinkle supports both Transformers and Megatron backends, as well as multi-tenant LoRA training — multiple users share a single base model while each trains their own adapter.
2. Local Multi-GPU Training
Overview
Training on 1–8 local GPUs or NPUs. Twinkle is built on PyTorch native interfaces and supports parallel strategies such as FSDP2 and DDP.
Full Code
from peft import LoraConfig
from tqdm import tqdm
import twinkle
from twinkle import DeviceMesh, get_device_placement, get_logger
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor
# Build device_mesh: fsdp=4, dp=2, using 8 GPUs in total
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
# Use torchrun mode
twinkle.initialize(mode='local', global_device_mesh=device_mesh)
logger = get_logger()
def eval(model):
# Validation set: 100 samples
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(100)))
dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
dataset.encode()
dataloader = DataLoader(dataset=dataset, batch_size=8)
for step, batch in tqdm(enumerate(dataloader)):
model.forward_only(inputs=batch)
model.calculate_loss()
metrics = model.calculate_metric(is_training=False)
return metrics
def train():
# Training set: 1000 samples
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
# Set template to prepare encoding
dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
# Preprocess: replace placeholders in self-cognition data
dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
# Encode dataset
dataset.encode()
# Global batch size = 8; each of the 8 GPUs processes 1 sample
dataloader = DataLoader(dataset=dataset, batch_size=8)
# Load model
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
model.model._no_split_modules = {'Qwen3_5DecoderLayer'}
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')
# Add LoRA adapter named 'default'
# Comment this out to switch to full-parameter training
model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
# Configure optimizer for LoRA
model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
# Configure learning rate scheduler
model.set_lr_scheduler(
scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader))
logger.info(get_device_placement())
# Print training config
logger.info(model.get_train_configs())
logger.info(f'Total steps: {len(dataloader)}')
loss_metric = 99.0
# LoRA training: ~8G * 8 GPU memory
# Full-parameter training: ~18G * 8 GPU memory
for step, batch in enumerate(dataloader):
# Forward + backward pass
model.forward_backward(inputs=batch)
# Gradient clipping + optimizer step
model.clip_grad_and_step()
if step % 20 == 0:
# Print training metrics
metric = model.calculate_metric(is_training=True)
logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
if step > 0 and step % 40 == 0:
# Periodic evaluation
metrics = eval(model)
logger.info(f'Eval metric: {metrics}')
metrics['step'] = step
# Save best checkpoint
if loss_metric > float(metrics['loss']):
model.save(f'checkpoint-{step}')
loss_metric = float(metrics['loss'])
model.save(f'last-checkpoint')
if __name__ == '__main__':
train()
Launch Command
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 fsdp2.py
Key Design Notes
DeviceMesh Parallelism Strategy
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
A hybrid parallel strategy with 4-way FSDP sharding + 2-way data parallelism. Qwen3.5-4B weights occupy ~8GB in bf16 precision. In LoRA mode, single-GPU memory usage is around 18GB — 8× A100/H100 handles it comfortably.
Gradient Accumulation
model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
gradient_accumulation_steps=2 updates parameters every 2 micro-batches, effectively doubling the batch size. Useful when GPU memory is constrained but a larger effective batch is desired.
Algorithm Transparency
All key training steps — forward pass, backward pass, gradient clipping, checkpoint saving — are written directly in the main loop. Developers retain full control over the training process. The underlying distributed communication is handled by Twinkle’s infra layer; switching between Ray and torchrun has no impact on the main loop.
For complex algorithms, this transparency is especially important.
RL Training: Reinforcement Learning with Ray
Twinkle supports multiple RL algorithms, including GRPO, RLOO, GSPO, and more. Here we use GRPO (Group Relative Policy Optimization) as an example — the core RL algorithm used in DeepSeek-R1 — to show how RL training works in Ray mode.
Unlike PPO, GRPO does not require training a separate value model. Instead, it estimates the advantage function using relative rewards within a sampled group, simplifying the training pipeline and reducing memory overhead. Twinkle’s Ray mode is particularly well-suited for RL algorithms that require model and sampler to run on separate devices. In the example below, 4 GPUs run model training while another 4 run vLLM sampling, coordinated through a Ray cluster:
from typing import List, Dict, Any
from peft import LoraConfig
import twinkle
from twinkle import DeviceMesh, DeviceGroup, get_device_placement, get_logger
from twinkle.advantage import GRPOAdvantage
from twinkle.checkpoint_engine import CheckpointEngineManager
from twinkle.data_format import SamplingParams
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.processor import InputProcessor
from twinkle.reward import GSM8KAccuracyReward, GSM8KFormatReward
from twinkle.sampler import vLLMSampler
from twinkle.template import Template
from twinkle.metric import CompletionRewardMetric
from twinkle.preprocessor.llm import GSM8KProcessor
logger = get_logger()
MODEL_ID = 'ms://Qwen/Qwen3.5-4B'
MODEL_GPUS = 4 # 4 GPUs for model training
SAMPLER_GPUS = 4 # 4 GPUs for vLLM sampling
NUM_GPUS = MODEL_GPUS + SAMPLER_GPUS
NUM_GENERATIONS = 8 # 8 samples per group
MAX_NEW_TOKENS = 4096
LEARNING_RATE = 1e-5
MAX_STEPS = 200
BATCH_SIZE = 16
MINI_BATCH_SIZE = 16
MICRO_BATCH_SIZE = 2
ADAPTER_NAME = 'default'
def create_gsm8k_dataset():
dataset = Dataset(DatasetMeta('ms://modelscope/gsm8k', subset_name='main', split='train'))
dataset.set_template('Qwen3_5Template', model_id=MODEL_ID, max_length=2048)
dataset.map(GSM8KProcessor())
dataset.encode(add_generation_prompt=True)
return dataset
def compute_rewards(trajectories: List[Dict[str, Any]]):
accuracy_reward_fn = GSM8KAccuracyReward()
format_reward_fn = GSM8KFormatReward()
accuracy_rewards = accuracy_reward_fn(trajectories)
format_rewards = format_reward_fn(trajectories)
total_rewards = [a + f for a, f in zip(accuracy_rewards, format_rewards)]
return total_rewards, format_rewards, accuracy_rewards
def main():
# Assign model and sampler to separate GPU groups
device_groups = [
DeviceGroup(name='model', ranks=list(range(MODEL_GPUS)), device_type='GPU'),
DeviceGroup(name='sampler', ranks=list(range(MODEL_GPUS, NUM_GPUS)), device_type='GPU'),
]
model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS)
sampler_mesh = DeviceMesh.from_sizes(world_size=SAMPLER_GPUS, dp_size=SAMPLER_GPUS)
# Initialize in Ray mode
twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False)
lora_config = LoraConfig(target_modules='all-linear', r=32, lora_alpha=64, lora_dropout=0.05)
# Model deployed in the 'model' group
model = TransformersModel(model_id=MODEL_ID, device_mesh=model_mesh, remote_group='model')
model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
model.set_optimizer('AdamW', lr=LEARNING_RATE)
model.set_lr_scheduler('CosineAnnealingLR', T_max=MAX_STEPS, eta_min=0)
model.set_loss('GRPOLoss', epsilon=0.2)
model.set_processor(InputProcessor)
model.set_template('Qwen3_5Template', model_id=MODEL_ID)
# Sampler deployed in the 'sampler' group
sampler = vLLMSampler(
model_id=MODEL_ID,
engine_args={
'gpu_memory_utilization': 0.8,
'max_model_len': 4096,
'max_lora_rank': 32,
'enable_lora': False,
},
device_mesh=sampler_mesh,
remote_group='sampler',
)
sampler.set_template('Qwen3_5Template', model_id=MODEL_ID)
ckpt_manager = CheckpointEngineManager(model=model, sampler=sampler)
dataloader = DataLoader(
dataset=create_gsm8k_dataset,
batch_size=BATCH_SIZE,
min_batch_size=BATCH_SIZE,
device_mesh=model_mesh,
remote_group='model',
)
advantage_fn = GRPOAdvantage()
metrics = CompletionRewardMetric()
sampling_params = SamplingParams(max_tokens=MAX_NEW_TOKENS, num_samples=1, logprobs=1)
optim_step = 0
logger.info(get_device_placement())
for batch in dataloader:
if optim_step >= MAX_STEPS:
break
metrics.reset()
global_prompts = batch if isinstance(batch, list) else [batch]
# Sync weights to sampler
ckpt_manager.sync_weights(merge_and_sync=True)
sampler.reset_prefix_cache()
# Group sampling: sample NUM_GENERATIONS completions per prompt
sample_responses = sampler.sample(
global_prompts * NUM_GENERATIONS,
sampling_params,
)
all_input_data = []
all_old_logps = []
all_completion_lengths = []
for sample_response in sample_responses:
for sequence in sample_response.sequences:
all_input_data.append(sequence.new_input_feature)
all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
all_completion_lengths.append(len(sequence.tokens))
# Compute rewards
total_rewards, format_rewards, accuracy_rewards = compute_rewards(all_input_data)
metrics.accumulate(
completion_lengths=all_completion_lengths,
rewards={
'total': total_rewards,
'format': format_rewards,
'accuracy': accuracy_rewards,
},
)
# GRPO advantage estimation: group-level normalization
advantages = advantage_fn(total_rewards, num_generations=NUM_GENERATIONS, scale='group').tolist()
# Mini-batch training
total_completions = len(all_input_data)
for mb_start in range(0, total_completions, MINI_BATCH_SIZE):
mb_end = min(mb_start + MINI_BATCH_SIZE, total_completions)
mb_inputs = all_input_data[mb_start:mb_end]
mb_old_logps = all_old_logps[mb_start:mb_end]
mb_advantages = advantages[mb_start:mb_end]
model.forward_backward(
inputs=mb_inputs,
old_logps=mb_old_logps,
advantages=mb_advantages,
micro_batch_size=MICRO_BATCH_SIZE,
)
model.clip_grad_and_step()
optim_step += 1
if optim_step >= MAX_STEPS:
break
log_dict = metrics.calculate()
log_dict.update(model.calculate_metric(is_training=True))
metrics.reset()
logger.info(f'[Step {optim_step}/{MAX_STEPS}] {log_dict}')
logger.info(f'Training completed. optim_steps={optim_step}')
model.save('grpo-gsm8k-checkpoint')
if __name__ == '__main__':
main()
Since this runs on a Ray cluster, launching is simply:
python train.py
Key Design Points for GRPO Training:
Model-sampler separation:
DeviceGroupsplits 8 GPUs into two groups. Training and sampling run independently, allowing the sampling pipeline to fully leverage vLLM’s high throughput.Group sampling strategy:
global_prompts * NUM_GENERATIONSproduces multiple completions per prompt, enabling advantage estimation via intra-group relative rewards — no separate value model needed.Weight synchronization:
ckpt_manager.sync_weights()syncs the training model weights to vLLM before each sampling step, ensuring the sampler always uses the latest policy.Algorithm components exposed:
GRPOAdvantageandGRPOLossare registered directly on the model and can be swapped for other RL algorithm components without modifying any other code.
The core value of this pattern: the entire RL training loop — sampling, reward computation, advantage estimation, gradient update — is laid out in a visible Python main loop with no hidden magic. Differences between RL algorithms typically amount to swapping a few components.
3. Remote Training: Client-Server Architecture
When compute resources and service consumers are separated — enterprise training platforms, cloud Serverless training services — training capabilities need to be exposed as an API.
Twinkle supports two client integration modes:
Twinkle Client: API identical to local training, suitable for scenarios requiring fine-grained control
Tinker Client: Compatible with the Tinker ecosystem, with a simpler calling style
The server maintains a single base model; multiple clients can train their own LoRA adapters in parallel.
3.1 Twinkle Client: Fine-Grained Control
Twinkle Client provides an API nearly identical to local training, ideal for scenarios that require fine-grained control over the training process.
import dotenv
dotenv.load_dotenv('.env')
from peft import LoraConfig
from twinkle import get_logger
from twinkle.dataset import DatasetMeta
from twinkle_client import init_twinkle_client
from twinkle_client.dataloader import DataLoader
from twinkle_client.dataset import Dataset
from twinkle_client.model import MultiLoraTransformersModel
logger = get_logger()
# Initialize the Twinkle client
client = init_twinkle_client(base_url='http://127.0.0.1:8000', api_key='EMPTY_TOKEN')
# Query existing training runs and checkpoints
runs = client.list_training_runs()
resume_path = None
for run in runs:
logger.info(run.model_dump_json(indent=2))
checkpoints = client.list_checkpoints(run.training_run_id)
for checkpoint in checkpoints:
logger.info(checkpoint.model_dump_json(indent=2))
# Uncomment to resume from a specific checkpoint:
# resume_path = checkpoint.twinkle_path
def train():
# Prepare dataset
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
dataset.map('SelfCognitionProcessor', init_args={'model_name': 'twinkle model', 'model_author': 'ModelScope Community'})
dataset.encode(batched=True)
dataloader = DataLoader(dataset=dataset, batch_size=4)
# Configure model
model = MultiLoraTransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
lora_config = LoraConfig(target_modules='all-linear')
model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
model.set_template('Qwen3_5Template')
model.set_processor('InputProcessor', padding_side='right')
model.set_loss('CrossEntropyLoss')
model.set_optimizer('AdamW', lr=1e-4)
model.set_lr_scheduler('LinearLR')
# Resume from checkpoint if available
if resume_path:
logger.info(f'Resuming training from {resume_path}')
model.load(resume_path, load_optimizer=True)
logger.info(model.get_train_configs())
for epoch in range(3):
logger.info(f'Starting epoch {epoch}')
for step, batch in enumerate(dataloader):
# Forward + backward
output = model.forward_backward(inputs=batch)
if step % 2 == 0:
logger.info(f'Current is step {step // 2}, loss: {output}')
model.clip_grad_norm(1.0)
model.step()
model.zero_grad()
model.lr_step()
# Save checkpoint
twinkle_path = model.save(name=f'twinkle-epoch-{epoch}', save_optimizer=True)
logger.info(f'Saved checkpoint: {twinkle_path}')
if __name__ == '__main__':
train()
Twinkle Client highlights:
API identical to local training — no additional learning curve
Supports checkpoint management and resume from checkpoint
Dynamically swap LoRA adapters, loss functions, and optimizer components
3.2 Tinker Client: Simple and Ready to Use
Tinker is a lightweight training API. Twinkle provides full support for the Tinker client — a few lines of code is all it takes to start training. Existing Tinker-based projects can be migrated directly to a Twinkle server.
import os
from tinker import types
from tqdm import tqdm
from twinkle import init_tinker_client
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor
from twinkle.server.common import input_feature_to_datum
# Initialize Tinker client (must be called before importing ServiceClient)
init_tinker_client()
from tinker import ServiceClient
# Base model
base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://www.modelscope.cn/twinkle'
def train():
# Prepare dataset
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
dataset.set_template('Qwen3_5Template', model_id=f'ms://{base_model}', max_length=256)
dataset.map(SelfCognitionProcessor('Twinkle Model', 'ModelScope Team'), load_from_cache_file=False)
dataset.encode(batched=True, load_from_cache_file=False)
dataloader = DataLoader(dataset=dataset, batch_size=8)
# Initialize training client
service_client = ServiceClient(
base_url=base_url,
api_key=os.environ.get('MODELSCOPE_TOKEN')
)
training_client = service_client.create_lora_training_client(base_model=base_model, rank=16)
# Training loop
for epoch in range(3):
print(f'Epoch {epoch}')
for step, batch in tqdm(enumerate(dataloader)):
# Convert input format
input_datum = [input_feature_to_datum(input_feature) for input_feature in batch]
# Remote forward + backward
fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy')
# Remote optimizer step
optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
# Wait for results
fwdbwd_result = fwdbwd_future.result()
optim_result = optim_future.result()
print(f'Training Metrics: {optim_result}')
# Save checkpoint
save_future = training_client.save_state(f'twinkle-lora-{epoch}')
save_result = save_future.result()
print(f'Saved checkpoint to {save_result.path}')
if __name__ == '__main__':
train()
Tinker Client highlights:
Minimal API surface, easy to get started
Fully compatible with the Tinker ecosystem — existing code migrates seamlessly
Supports ModelScope’s official training environment (see below)
3.3 ModelScope Official Training Environment
Alongside the open-source release of Twinkle, ModelScope provides a hosted model training service (Training as a Service, TaaS) powered by its own compute infrastructure. Developers can access Twinkle’s training capabilities for free via API, without provisioning any GPUs.
How to use:
Register a ModelScope account at modelscope.cn
Obtain your API Key on the Token Management page
Use the Tinker Client code above with the following endpoint:
base_url = 'https://www.modelscope.cn/twinkle'
base_model = 'Qwen/Qwen3.5-4B' # Model currently deployed in the official environment
4. Choosing the Right Training Mode
| Scenario | Recommended Approach | Key Advantage |
|---|---|---|
| Local experimentation | Single GPU / torchrun | Code-as-config, high debugging efficiency |
| Large-scale distributed training | torchrun + FSDP2 / Ray | Native parallel performance, production-ready |
| Enterprise training platform | Twinkle Client + self-hosted server | Multi-tenant isolation, fine-grained control |
| Rapid prototyping | Tinker Client + ModelScope TaaS | Zero resource setup, instant access |
| Existing Tinker codebase | Tinker Client | Seamless migration, ecosystem compatibility |
Recommendations:
If you are an algorithm researcher who frequently iterates on the training pipeline, start with torchrun mode and consider moving to a service-based setup once experiments are validated.
If you are a platform engineer building an internal training service, deploy Twinkle Server and offer both Twinkle Client and Tinker Client based on your users’ preferences.
If you just want to try Twinkle quickly, use the ModelScope official environment — get your first training run done in 5 minutes.
Twinkle’s design philosophy is to give you the building blocks, not make the decisions for you. Whether you need maximum performance at scale or maximum convenience via API, there’s a solution that fits.