NCCLCheckpointEngine
A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs.
Usage Example
from twinkle.checkpoint_engine import NCCLCheckpointEngine
# In training process (rank 0)
engine = NCCLCheckpointEngine(bucket_size=512<<20) # 512MB bucket
engine.is_master = True
engine.prepare()
engine.init_process_group(rank=0, world_size=5)
# Send weights
await engine.send_weights(model.named_parameters())
engine.finalize()
# In inference process (rank 1-4)
engine = NCCLCheckpointEngine(bucket_size=512<<20)
engine.prepare()
engine.init_process_group(rank=1, world_size=5, master_metadata=metadata)
# Receive weights
async for name, tensor in engine.receive_weights():
model.load_state_dict({name: tensor}, strict=False)
engine.finalize()
Features
High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer
Zero-Copy: Direct transfer between GPU memories without going through CPU
Bucketed Transfer: Supports bucketed transfer for large models
Configuration Parameters
bucket_size: Weight bucket size, controls the amount of data transferred each time. Larger buckets can improve transfer efficiency but consume more memory
timeout: Transfer timeout duration
NCCLCheckpointEngine is the recommended choice for GPU training, providing the highest transfer performance.