NCCLCheckpointEngine

A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs.

Usage Example

from twinkle.checkpoint_engine import NCCLCheckpointEngine

# In training process (rank 0)
engine = NCCLCheckpointEngine(bucket_size=512<<20)  # 512MB bucket
engine.is_master = True
engine.prepare()
engine.init_process_group(rank=0, world_size=5)

# Send weights
await engine.send_weights(model.named_parameters())
engine.finalize()

# In inference process (rank 1-4)
engine = NCCLCheckpointEngine(bucket_size=512<<20)
engine.prepare()
engine.init_process_group(rank=1, world_size=5, master_metadata=metadata)

# Receive weights
async for name, tensor in engine.receive_weights():
    model.load_state_dict({name: tensor}, strict=False)
engine.finalize()

Features

  • High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer

  • Zero-Copy: Direct transfer between GPU memories without going through CPU

  • Bucketed Transfer: Supports bucketed transfer for large models

Configuration Parameters

  • bucket_size: Weight bucket size, controls the amount of data transferred each time. Larger buckets can improve transfer efficiency but consume more memory

  • timeout: Transfer timeout duration

NCCLCheckpointEngine is the recommended choice for GPU training, providing the highest transfer performance.