Megatron-LM Backend

Last updated: 12/01/2025.

We support Megatron Backend by implementing various workers for actor, critic, reference, rollout and reward models. We also implement the 3DHybridEngine using Megatron-LM and vLLM/SGLang in megatron_vllm.py and megatron_sglang.py.

Pros

Support 5D parallelism (TP, EP, CP, DP, PP) and sequence parallelism for best scalablility and throughput.
3D HybridEngine can significantly reduce peak memory usage and reduce weight synchronize overhead between actor and rollout.

Cons

Huggingface Models and Megatron checkpoints need tools for conversion.

我们通过实现用于 actor、critic、reference、rollout 和 reward 模型的各种 workers 来支持 Megatron Backend（后端）。我们还在 megatron_vllm.py 和 megatron_sglang.py 中使用 Megatron-LM 和 vLLM/SGLang 实现了 3DHybridEngine。

优点

支持 5D 并行（TP、EP、CP、DP、PP）和序列并行，以实现最佳的可扩展性和吞吐量。
3D HybridEngine 可以显著降低峰值内存使用，并减少 actor 和 rollout 之间的权重同步开销。

缺点

Hugging Face 模型和 Megatron 检查点需要转换工具。

Development Progress

我们通过实现各种 workers 来支持 Megatron Backend，这些 workers 用于 actor、critic、reference、rollout 和 reward 模型。我们还在 megatron_vllm.py 和 megatron_sglang.py 中使用 Megatron-LM 和 vLLM/SGLang 实现了 3DHybridEngine。

优势

支持 5D 并行（TP、EP、CP、DP、PP）和序列并行，以获得最佳的可扩展性和吞吐量。
3D HybridEngine 可以显著降低峰值内存使用并减少 actor 和 rollout 之间的权重同步开销。

劣势

Hugging Face 模型和 Megatron 检查点需要转换工具。

开发进度

请注意，[Deprecated] 表示该功能在 verl 最新版本中不受支持。 [To-Optimize] 表示该功能已实现但尚未优化。 [WIP] 表示该功能正在开发中。 [In-Release] 表示该功能已准备好并正在审查过程中，随时可能发布。

Utils of Megatron Workers

Megatron Worker 工具

MegatronWorker

MegatronWorker is the base class of different megatron worker classes. In this class, get_megatron_global_info and get_megatron_rank_info function to retrieve the 3D parallel world size and rank of each Worker running on specific GPU. These information will be used in transfer protocol for Megatron Backend.

The following Worker class for different models will be utilized to construct the WorkerGroup .

We implement various of APIs for each Worker class decorated by the @register(dispatch_mode=) . These APIs can be called by the ray driver process. The data can be correctly collect and dispatch following the dispatch_mode on each function. The supported dispatch_model (i.e., transfer protocols) can be found in decorator.py.

MegatronWorker 是不同 Megatron worker 类的基类。在此类中，get_megatron_global_info 和 get_megatron_rank_info 函数用于检索运行在特定 GPU 上的每个 Worker 的 3D 并行世界大小和排名。这些信息将用于 Megatron Backend 的传输协议。

以下针对不同模型的 Worker 类将用于构建 WorkerGroup。

我们为每个通过 @register(dispatch_mode=) 装饰的 Worker 类实现了各种 API。这些 API 可以由 ray 驱动进程调用。数据可以根据每个函数上的 dispatch_mode 正确收集和分派。支持的 dispatch_mode（即传输协议）可以在 decorator.py 中找到。

ActorRolloutRefWorker

This class is implemented for Actor/Rollout HybridEngine or for the reference model to initialize their model and perform computation.

这个类是为 Actor/Rollout HybridEngine 或 reference 模型实现的，用于初始化模型并执行计算。

Actor/Rollout HybridEngine

HybridEngine、Actor 和 Rollout 初始化 API。

@register(dispatch_mode=Dispatch.ONE_TO_ALL)
def init_model(self):

ONE_TO_ALL: when calling the init_model function from the driver process, each worker (on a GPU) will execute the following model initialization process.

从驱动进程调用 init_model 函数时，每个 worker（位于 GPU 上）将执行以下模型初始化过程。

The initialization details of HybridEngine, Actor and Rollout are highlighted below:

下面突出显示了 HybridEngine、Actor 和 Rollout 的初始化细节：

MegatronPPOActor implements the simple PPO computation logics when the model is built with Megatron, including compute log prob, model update.
vLLMRollout support generation with vLLM. We modify the vLLM Engine and make it executed under SPMD to fit into our WorkerGroup design.

当模型使用 Megatron 构建时，MegatronPPOActor 实现简单的 PPO 计算逻辑，包括计算对数概率和模型更新。
vLLMRollout 支持使用 vLLM 进行生成。我们修改了 vLLM Engine，使其在 SPMD 下运行，以适应我们的 WorkerGroup 设计。

See source code for more information.

有关更多信息，请参见 source code。

# build actor model
self.actor = MegatronPPOActor(config=self.config.actor,
                              model_config=self.actor_model_config,
                              megatron_config=megatron_config,
                              actor_module=self.actor_module,
                              actor_optimizer=self.actor_optimizer,
                              actor_optimizer_config=self.actor_optim_config)

# build rollout
# rollout initialization
rollout = vLLMRollout(actor_module=params,
                     config=self.config.rollout,
                     tokenizer=self.tokenizer,
                     model_hf_config=self.actor_model_config,
                     draw_tp=mpu.get_tensor_model_parallel_world_size())
...

Generate sequence and recompute log prob

@register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
def generate_sequences(self, prompts: DataProto):

Dispatch.MEGATRON_PP_AS_DP_PROTO: The PP dimension of the actor model will be regarded as DP dimension. Then the driver process will dispatch and collect the data according to this reorganization. This is because, in HybridEngine, the actor weight, which usually applied larger 3D parallel sizes, will be gathered along the PP dimension and TP dimension. Therefore, the corresponding data should be dispatched and collected through the 3D parallel group of the rollout model, rather than the actor model. However, the world_size and rank information can only be retrieved from get_megatron_global_info and get_megatron_rank_info, which records the 3D information for the actor model. Moreover, the data resharding inside TP dimension will be processed within the HybridEngine.
In this function, the rollout model will perform auto-regressive generation and the actor model will recompute the old log prob for the generated response.

生成序列并重新计算对数概率

@register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
def generate_sequences(self, prompts: DataProto):

Dispatch.MEGATRON_PP_AS_DP_PROTO: Actor 模型的 PP 维度将被视为 DP 维度。然后驱动进程将根据此重组分派和收集数据。这是因为在 HybridEngine 中，actor 权重（通常应用较大的 3D 并行大小）将沿着 PP 维度和 TP 维度进行聚合。因此，对应的数据应通过 rollout 模型的 3D 并行组分派和收集，而不是 actor 模型。然而，world_size 和 rank 信息只能从 get_megatron_global_info 和 get_megatron_rank_info 检索，这些函数记录了 actor 模型的 3D 信息。此外，TP 维度内的数据重新分片将在 HybridEngine 内处理。
在此函数中，rollout 模型将执行自回归生成，actor 模型将为生成的响应重新计算旧的对数概率。

Update actor model

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_actor(self, data: DataProto):

Dispatch.MEGATRON_COMPUTE_PROTO: User passes the data partitioned by DP dimension. The data is dispatched to all tp/pp ranks within the same dp group, and ultimately only collects output data from tp=0 and the last pp.
Update the actor model weight using PPO & entropy loss.

更新 actor 模型

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_actor(self, data: DataProto):

Dispatch.MEGATRON_COMPUTE_PROTO: 用户传递按 DP 维度分区的数据。数据被分派到同一 dp 组内的所有 tp/pp 排名，最终仅从 tp=0 和最后一个 pp 收集输出数据。
使用 PPO 和熵损失更新 actor 模型权重。

..note:

Currently, training Tensor Parallel Size can be different from inference
Tensor Parallel Size.

Note

目前，训练时的张量并行大小可以不同于推理时的张量并行大小。

ReferenceModel

Reference model initialization

The reference model is initialized using the same function as the actor model without initializing the HybridEngine and Optimizer. Then the actor model is also wrapped by the MegatronPPOActor.

Reference 模型初始化

Reference 模型使用与 actor 模型相同的函数进行初始化，但不初始化 HybridEngine 和优化器。然后 actor 模型也由 MegatronPPOActor 包装。

Compute reference log prob

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_ref_log_prob(self, data: DataProto):

In this function, the reference model will call the compute log prob function in MegatronPPOActor to compute the reference log prob.

计算 reference 对数概率

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_ref_log_prob(self, data: DataProto):

在此函数中，reference 模型将调用 MegatronPPOActor 中的计算对数概率函数来计算 reference 对数概率。

CriticWorker and RewardWorker

Model initialization

Quite similar to reference model. The CriticWorker will perform additional initialization for the Optimizer.

模型初始化

与 reference 模型非常相似。CriticWorker 将为优化器执行额外的初始化。

Compute Values for CriticWorker

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_values(self, data: DataProto):

为 CriticWorker 计算值

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_values(self, data: DataProto):

Update Critic

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_critic(self, data: DataProto):

更新 Critic

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_critic(self, data: DataProto):

Compute Reward

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_rm_score(self, data: DataProto):

计算奖励

@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_rm_score(self, data: DataProto):

Utils of Train Optimization

训练优化工具

Offload

When resources are tight, the offload method can lower GPU memory usage, helping training and inference frameworks work well under verl. It moves parameters, gradients, and optimizers to CPU memory and only loads them back to the GPU when needed.

If you want to use the offload, you can add the following parameters for the actor and ref separately.

卸载

当资源紧张时，卸载方法可以降低 GPU 内存使用，帮助训练和推理框架在 verl 下良好运行。它将参数、梯度和优化器移动到 CPU 内存中，并仅在需要时再加载回 GPU。

如果您想使用卸载，可以分别为 actor 和 ref 添加以下参数。

# For the actor
actor_rollout_ref.actor.megatron.param_offload=True \
actor_rollout_ref.actor.megatron.grad_offload=True \
actor_rollout_ref.actor.megatron.optimizer_offload=True \
# For the ref w/o grad and optimizer
actor_rollout_ref.ref.megatron.param_offload=True \

For the critic, you can include these parameters.

对于 critic，您可以包含这些参数。

# For the critic
critic.megatron.param_offload=True \
critic.megatron.grad_offload=True \
critic.megatron.optimizer_offload=True \