CheckpointEngine

CheckpointEngine is a component used to synchronize model weights between trainer and inference processes, primarily used in RLHF training to synchronize weights between Actor models and Rollout samplers.

Basic Interface

class CheckpointEngine(ABC):
    """Checkpoint engine base class

    The checkpoint engine handles weight synchronization between trainer and inference processes.
    """

    @abstractmethod
    def prepare(self) -> dict[str, Any]:
        """Prepare for weight synchronization"""
        ...

    @abstractmethod
    def init_process_group(self, rank: int, world_size: int, **kwargs):
        """Initialize process group"""
        ...

    @abstractmethod
    async def send_weights(self, weight_generator):
        """Send weights (called in trainer process)"""
        ...

    @abstractmethod
    def receive_weights(self) -> AsyncGenerator:
        """Receive weights (called in inference process)"""
        ...

    @abstractmethod
    def finalize(self):
        """Clean up resources"""
        ...

Available Checkpoint Engines

Twinkle provides two checkpoint engine implementations:

NCCLCheckpointEngine

A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs.

  • High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer

  • Zero-Copy: Direct transfer between GPU memories without going through CPU

  • Bucketed Transfer: Supports bucketed transfer for large models

See: NCCLCheckpointEngine

HCCLCheckpointEngine

A checkpoint engine that uses HCCL for weight transfer between Ascend NPUs.

  • NPU Optimized: Weight transfer optimized specifically for Ascend NPUs

  • Efficient Communication: Uses HCCL for high-speed communication between NPUs

  • Compatible Interface: Maintains consistent interface with NCCLCheckpointEngine

See: HCCLCheckpointEngine

How to Choose

  • NCCLCheckpointEngine: Suitable for GPU environments, provides the highest transfer performance

  • HCCLCheckpointEngine: Suitable for Ascend NPU environments

Checkpoint engine is a key component of RLHF training infrastructure, ensuring that trainers and samplers use consistent model weights. Currently, synchronization is divided into two cases based on merge_and_sync=True/False. When set to True, the LoRA is merged into the base model and then synchronized. When set to False, only the LoRA weights are synchronized. Additionally, for multi-tenant scenarios, LoRA files are directly attached to vLLM. When merge_and_sync=False or in multi-tenant mode, vLLM’s startup parameter enable_lora=True needs to be enabled. When merge_and_sync=True or using full parameters, this value should be set to False.