MegatronModel
This model encapsulates Megatron LLM and can start the model using TP/DP/CP/PP/EP combinations.
Note: VPP support currently has issues, please do not configure and use it for now.
class MegatronModel:
def __init__(
self,
model_id: str,
config: Optional[PretrainedConfig] = None,
device_mesh: Optional[DeviceMesh] = None,
mixed_precision: Literal['no', 'fp16', 'bf16'] = 'bf16',
**kwargs,
):
...
...
model_id: Model id
config: Configuration for starting the model
device_mesh: DeviceMesh information
mixed_precision: Mixed precision information, default
bf16, recommended to keep unchanged if you have GPUs with 30 series or abovekwargs:
All Megatron initialization parameters, i.e.,
TransformersConfigconfigurations can be passed into kwargs.
MegatronModel supports the @remote_class annotation and supports device_mesh, which means it can run in Ray workers.
Usage example:
from twinkle.model import MegatronModel
from twinkle import DeviceMesh
from twinkle.dataloader import DataLoader
dataloader = DataLoader(...)
model = MegatronModel(model_id='ms://Qwen/Qwen3.5-4B', device_mesh=DeviceMesh.from_sizes(dp_size=2, tp_size=2, pp_size=2), remote_group='actor')
model.add_adapter_to_model(...)
model.set_optimizer('default', adapter_name='...')
for data in dataloader:
model.forward_backward(...)
model.clip_grad_and_step(..., gradient_accumulation_steps=16)
Note:
Megatron models do not support using AdamW’s original optimizer, only support configuring
MegatronDistributedOptimizer, you can passMegatronDistributedOptimizer,defaultto use itMegatron models do not support using other lr_schedulers, only support using
OptimizerParamScheduler, you can passOptimizerParamScheduler,defaultto use itYou need to pass tp/cp/dp/ep/pp/sequence_parallel configurations into the device_mesh parameter to facilitate twinkle to manage data distribution. These parameters will be passed by device_mesh to the megatron initialization process