Expert Parallel (EP)

Expert Parallel distributes Mixture-of-Experts (MoE) model experts across multiple GPUs, allowing each rank to hold a subset of experts. This reduces per-GPU memory and enables training of large MoE models.

Overview

Concept Description
ExpertParallelConfig Configuration dataclass controlling EP behavior
apply_expert_parallel() Entry point that shards experts and patches forward
shard_experts() Evenly splits experts across EP ranks
patch_forward() Replaces MoE block forward with EP-aware all-to-all communication

Configuration

from twinkle.model.transformers.moe.expert_parallel import ExpertParallelConfig

config = ExpertParallelConfig(
    enabled=True,              # Enable expert parallel
    router_dtype='fp32',       # Router computation dtype: 'fp32', 'bf16', 'fp16'
    keep_router_logits=True,   # Return router logits alongside hidden states
    ignore_shared_experts=False,# Skip shared expert computation (e.g. DeepSeek)
    ep_size=None,              # EP world size (consumed by TransformersModel)
)

Usage with DeviceMesh

EP is activated by setting ep_size in DeviceMesh.from_sizes(). The framework automatically calls apply_expert_parallel() during model initialization.

from twinkle.utils import DeviceMesh

# 8 GPUs: 2-way EP × 4-way data parallel
device_mesh = DeviceMesh.from_sizes(
    world_size=8,
    dp_size=4,
    ep_size=2,
)

For combined EP + FSDP sharding on the expert parameters:

# 8 GPUs: 2-way EP with FSDP within each EP group
device_mesh = DeviceMesh.from_sizes(
    world_size=8,
    dp_size=2,
    ep_size=2,
    ep_fsdp_size=2,
)

Communication Pattern

The EP forward pass follows a 4-stage pipeline:

  1. Preprocess — compute per-expert token counts and split sizes

  2. Token Pre-All2All — permute tokens by expert assignment, then all-to-all exchange across EP ranks

  3. Expert Compute — each rank runs its local experts on received tokens

  4. Token Post-All2All — all-to-all exchange results back, unpermute and apply routing weights

Input tokens → Router → [preprocess] → [pre_all2all] → [local experts] → [post_all2all] → Output

Requirements

  • num_experts must be divisible by ep_size

  • torch.distributed must be initialized

  • MoE blocks must define a gate/router module and experts (either nn.ModuleList or tensor-style gate_up_proj/down_proj)

  • Both ModuleList-style and tensor-style (fused) experts are supported

  • Shared experts (e.g. DeepSeek MoE) are handled automatically unless ignore_shared_experts=True