Expert Parallel (EP)
Expert Parallel distributes Mixture-of-Experts (MoE) model experts across multiple GPUs, allowing each rank to hold a subset of experts. This reduces per-GPU memory and enables training of large MoE models.
Overview
| Concept | Description |
|---|---|
| ExpertParallelConfig | Configuration dataclass controlling EP behavior |
| apply_expert_parallel() | Entry point that shards experts and patches forward |
| shard_experts() | Evenly splits experts across EP ranks |
| patch_forward() | Replaces MoE block forward with EP-aware all-to-all communication |
Configuration
from twinkle.model.transformers.moe.expert_parallel import ExpertParallelConfig
config = ExpertParallelConfig(
enabled=True, # Enable expert parallel
router_dtype='fp32', # Router computation dtype: 'fp32', 'bf16', 'fp16'
keep_router_logits=True, # Return router logits alongside hidden states
ignore_shared_experts=False,# Skip shared expert computation (e.g. DeepSeek)
ep_size=None, # EP world size (consumed by TransformersModel)
)
Usage with DeviceMesh
EP is activated by setting ep_size in DeviceMesh.from_sizes(). The framework automatically calls apply_expert_parallel() during model initialization.
from twinkle.utils import DeviceMesh
# 8 GPUs: 2-way EP × 4-way data parallel
device_mesh = DeviceMesh.from_sizes(
world_size=8,
dp_size=4,
ep_size=2,
)
For combined EP + FSDP sharding on the expert parameters:
# 8 GPUs: 2-way EP with FSDP within each EP group
device_mesh = DeviceMesh.from_sizes(
world_size=8,
dp_size=2,
ep_size=2,
ep_fsdp_size=2,
)
Communication Pattern
The EP forward pass follows a 4-stage pipeline:
Preprocess — compute per-expert token counts and split sizes
Token Pre-All2All — permute tokens by expert assignment, then all-to-all exchange across EP ranks
Expert Compute — each rank runs its local experts on received tokens
Token Post-All2All — all-to-all exchange results back, unpermute and apply routing weights
Input tokens → Router → [preprocess] → [pre_all2all] → [local experts] → [post_all2all] → Output
Requirements
num_expertsmust be divisible byep_sizetorch.distributedmust be initializedMoE blocks must define a
gate/routermodule andexperts(eithernn.ModuleListor tensor-stylegate_up_proj/down_proj)Both ModuleList-style and tensor-style (fused) experts are supported
Shared experts (e.g. DeepSeek MoE) are handled automatically unless
ignore_shared_experts=True