Megatron-LM Backend
===================

Last updated: 12/01/2025.

We support Megatron Backend by implementing various workers for actor,
critic, reference, rollout and reward models. We also implement the
``3DHybridEngine`` using Megatron-LM and vLLM/SGLang in
`megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_
and `megatron_sglang.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_sglang.py>`_.

**Pros**

- Support 5D parallelism (TP, EP, CP, DP, PP) and sequence parallelism
  for best scalablility and throughput.
- 3D HybridEngine can significantly reduce peak memory usage and reduce
  weight synchronize overhead between actor and rollout.

**Cons**

- Huggingface Models and Megatron checkpoints need tools for conversion.

我们通过实现用于 actor、critic、reference、rollout 和 reward 模型的各种 workers 来支持 Megatron Backend（后端）。我们还在 
`megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_ 和 
`megatron_sglang.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_sglang.py>`_ 中使用 Megatron-LM 和 vLLM/SGLang 实现了 ``3DHybridEngine``。

**优点**

- 支持 5D 并行（TP、EP、CP、DP、PP）和序列并行，以实现最佳的可扩展性和吞吐量。
- 3D HybridEngine 可以显著降低峰值内存使用，并减少 actor 和 rollout 之间的权重同步开销。

**缺点**

- Hugging Face 模型和 Megatron 检查点需要转换工具。

Development Progress
--------------------

我们通过实现各种 workers 来支持 Megatron Backend，这些 workers 用于 actor、critic、reference、rollout 和 reward 模型。我们还在 
`megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_ 和 
`megatron_sglang.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_sglang.py>`_ 中使用 Megatron-LM 和 vLLM/SGLang 实现了 ``3DHybridEngine``。

**优势**

- 支持 5D 并行（TP、EP、CP、DP、PP）和序列并行，以获得最佳的可扩展性和吞吐量。
- 3D HybridEngine 可以显著降低峰值内存使用并减少 actor 和 rollout 之间的权重同步开销。

**劣势**

- Hugging Face 模型和 Megatron 检查点需要转换工具。

开发进度
--------

请注意，[Deprecated] 表示该功能在 verl 最新版本中不受支持。
[To-Optimize] 表示该功能已实现但尚未优化。
[WIP] 表示该功能正在开发中。
[In-Release] 表示该功能已准备好并正在审查过程中，随时可能发布。

+------------------------+-----------------------------------------------------------+
| [已弃用]  | Megatron 3D 并行与自定义模型                              |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron 0.11.0 ``GPTModel`` 支持                         |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron GRPO 支持                                        |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron 与 vLLM 0.8.2，支持按张量权重加载                |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron 与上下文并行                                     |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Qwen2MoE 模型支持                                         |
+------------------------+-----------------------------------------------------------+
| [待优化]  | Megatron 分布式检查点                                     |
+------------------------+-----------------------------------------------------------+
| [待优化]  | Hugging Face 和 Megatron 检查点转换器                      |
+------------------------+-----------------------------------------------------------+
| [待优化]  | 高效融合线性、熵和交叉熵                                  |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron 卸载（参数、梯度、优化器）                        |
+------------------------+-----------------------------------------------------------+
| [已完成]  | Megatron 性能剖析                                         |
+------------------------+-----------------------------------------------------------+
| [即将发布]  | Megatron 0.12.0、TE 2.2 与 vLLM 0.8.3 和融合注意力机制    |
+------------------------+-----------------------------------------------------------+
| [开发中]  | Moonlight/DeepSeek-V3 模型支持                            |
+------------------------+-----------------------------------------------------------+
| [开发中]  | Expert 并行支持                                           |
+------------------------+-----------------------------------------------------------+
| [开发中]  | Megatron 支持动态批量大小                                 |
+------------------------+-----------------------------------------------------------+
| [待办]   | 性能调优                                                  |
+------------------------+-----------------------------------------------------------+
| [里程碑]  | 可与 DeepSeek-V3 671B 后训练一起运行                      |
+------------------------+-----------------------------------------------------------+


Utils of Megatron Workers
-------------------------

Megatron Worker 工具
--------------------

MegatronWorker
^^^^^^^^^^^^^^

``MegatronWorker`` is the base class of different megatron worker
classes. In this class, ``get_megatron_global_info`` and
``get_megatron_rank_info`` function to retrieve the 3D parallel world
size and rank of each ``Worker`` running on specific GPU. These information
will be used in transfer protocol for Megatron Backend.

The following ``Worker`` class for different models will be utilized to
construct the ``WorkerGroup`` .

We implement various of APIs for each ``Worker`` class decorated by the
``@register(dispatch_mode=)`` . These APIs can be called by the ray
driver process. The data can be correctly collect and dispatch following
the ``dispatch_mode`` on each function. The supported dispatch_model
(i.e., transfer protocols) can be found in `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_.

``MegatronWorker`` 是不同 Megatron worker 类的基类。在此类中，``get_megatron_global_info`` 和 ``get_megatron_rank_info`` 函数用于检索运行在特定 GPU 上的每个 ``Worker`` 的 3D 并行世界大小和排名。这些信息将用于 Megatron Backend 的传输协议。

以下针对不同模型的 ``Worker`` 类将用于构建 ``WorkerGroup``。

我们为每个通过 ``@register(dispatch_mode=)`` 装饰的 ``Worker`` 类实现了各种 API。这些 API 可以由 ray 驱动进程调用。数据可以根据每个函数上的 ``dispatch_mode`` 正确收集和分派。支持的 dispatch_mode（即传输协议）可以在 `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_ 中找到。

ActorRolloutRefWorker
^^^^^^^^^^^^^^^^^^^^^

This class is implemented for Actor/Rollout HybridEngine or for the
reference model to initialize their model and perform computation.

这个类是为 Actor/Rollout HybridEngine 或 reference 模型实现的，用于初始化模型并执行计算。

Actor/Rollout HybridEngine
''''''''''''''''''''''''''

1. HybridEngine、Actor 和 Rollout 初始化 API。

.. code:: python

   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
   def init_model(self):

``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
process, each worker (on a GPU) will execute the following model
initialization process.

从驱动进程调用 ``init_model`` 函数时，每个 worker（位于 GPU 上）将执行以下模型初始化过程。

The initialization details of HybridEngine, Actor and Rollout are
highlighted below:

下面突出显示了 HybridEngine、Actor 和 Rollout 的初始化细节：

1. ``MegatronPPOActor`` implements the simple PPO computation logics
   when the model is built with Megatron, including compute log prob,
   model update.
2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
   Engine and make it executed under SPMD to fit into our
   ``WorkerGroup`` design.

1. 当模型使用 Megatron 构建时，``MegatronPPOActor`` 实现简单的 PPO 计算逻辑，包括计算对数概率和模型更新。
2. ``vLLMRollout`` 支持使用 vLLM 进行生成。我们修改了 vLLM Engine，使其在 SPMD 下运行，以适应我们的 ``WorkerGroup`` 设计。

See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py#L63>`_ for more information.

有关更多信息，请参见 `source code <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py#L63>`_。

.. code:: python

   # build actor model
   self.actor = MegatronPPOActor(config=self.config.actor,
                                 model_config=self.actor_model_config,
                                 megatron_config=megatron_config,
                                 actor_module=self.actor_module,
                                 actor_optimizer=self.actor_optimizer,
                                 actor_optimizer_config=self.actor_optim_config)

   # build rollout
   # rollout initialization
   rollout = vLLMRollout(actor_module=params,
                        config=self.config.rollout,
                        tokenizer=self.tokenizer,
                        model_hf_config=self.actor_model_config,
                        draw_tp=mpu.get_tensor_model_parallel_world_size())
   ...

1. Generate sequence and recompute log prob

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
   def generate_sequences(self, prompts: DataProto):

- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: The PP dimension of the actor
  model will be regarded as DP dimension. Then the driver process will
  dispatch and collect the data according to this reorganization. This
  is because, in HybridEngine, the actor weight, which usually applied
  larger 3D parallel sizes, will be gathered along the PP dimension and
  TP dimension. Therefore, the corresponding data should be dispatched
  and collected through the 3D parallel group of the rollout model,
  rather than the actor model. However, the world_size and rank
  information can only be retrieved from ``get_megatron_global_info`` and
  ``get_megatron_rank_info``, which records the 3D information for the
  actor model. Moreover, the data resharding inside TP dimension will be
  processed within the HybridEngine.

- In this function, the rollout model will perform auto-regressive
  generation and the actor model will recompute the old log prob for the
  generated response.

1. 生成序列并重新计算对数概率

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
   def generate_sequences(self, prompts: DataProto):

- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: Actor 模型的 PP 维度将被视为 DP 维度。然后驱动进程将根据此重组分派和收集数据。这是因为在 HybridEngine 中，actor 权重（通常应用较大的 3D 并行大小）将沿着 PP 维度和 TP 维度进行聚合。因此，对应的数据应通过 rollout 模型的 3D 并行组分派和收集，而不是 actor 模型。然而，world_size 和 rank 信息只能从 ``get_megatron_global_info`` 和 ``get_megatron_rank_info`` 检索，这些函数记录了 actor 模型的 3D 信息。此外，TP 维度内的数据重新分片将在 HybridEngine 内处理。

- 在此函数中，rollout 模型将执行自回归生成，actor 模型将为生成的响应重新计算旧的对数概率。

3. Update actor model

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def update_actor(self, data: DataProto):

- ``Dispatch.MEGATRON_COMPUTE_PROTO``: User passes the data partitioned
  by DP dimension. The data is dispatched to all tp/pp ranks within the
  same dp group, and ultimately only collects output data from tp=0 and
  the last pp.
- Update the actor model weight using PPO & entropy loss.

3. 更新 actor 模型

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def update_actor(self, data: DataProto):

- ``Dispatch.MEGATRON_COMPUTE_PROTO``: 用户传递按 DP 维度分区的数据。数据被分派到同一 dp 组内的所有 tp/pp 排名，最终仅从 tp=0 和最后一个 pp 收集输出数据。
- 使用 PPO 和熵损失更新 actor 模型权重。

..note:: 

   Currently, training Tensor Parallel Size can be different from inference
   Tensor Parallel Size.

.. note:: 

   目前，训练时的张量并行大小可以不同于推理时的张量并行大小。

ReferenceModel
''''''''''''''

1. Reference model initialization

The reference model is initialized using the same function as the actor
model without initializing the HybridEngine and Optimizer. Then the
actor model is also wrapped by the ``MegatronPPOActor``.

1. Reference 模型初始化

Reference 模型使用与 actor 模型相同的函数进行初始化，但不初始化 HybridEngine 和优化器。然后 actor 模型也由 ``MegatronPPOActor`` 包装。

2. Compute reference log prob

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_ref_log_prob(self, data: DataProto):

- In this function, the reference model will call the compute log prob
  function in ``MegatronPPOActor`` to compute the reference log prob.

2. 计算 reference 对数概率

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_ref_log_prob(self, data: DataProto):

- 在此函数中，reference 模型将调用 ``MegatronPPOActor`` 中的计算对数概率函数来计算 reference 对数概率。

CriticWorker and RewardWorker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Model initialization

Quite similar to reference model. The CriticWorker will perform
additional initialization for the Optimizer.

1. 模型初始化

与 reference 模型非常相似。CriticWorker 将为优化器执行额外的初始化。

2. Compute Values for CriticWorker

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_values(self, data: DataProto):

2. 为 CriticWorker 计算值

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_values(self, data: DataProto):

3. Update Critic

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def update_critic(self, data: DataProto):

3. 更新 Critic

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def update_critic(self, data: DataProto):

4. Compute Reward

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_rm_score(self, data: DataProto):

4. 计算奖励

.. code:: python

   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
   def compute_rm_score(self, data: DataProto):


Utils of Train Optimization
---------------------------

训练优化工具
------------

Offload
^^^^^^^
When resources are tight, the offload method can lower GPU memory 
usage, helping training and inference frameworks work well under verl. 
It moves parameters, gradients, and optimizers to CPU memory and only 
loads them back to the GPU when needed.

If you want to use the offload, you can add the following parameters 
for the actor and ref separately. 

卸载
^^^^^
当资源紧张时，卸载方法可以降低 GPU 内存使用，帮助训练和推理框架在 verl 下良好运行。它将参数、梯度和优化器移动到 CPU 内存中，并仅在需要时再加载回 GPU。

如果您想使用卸载，可以分别为 actor 和 ref 添加以下参数。

.. code:: python

   # For the actor
   actor_rollout_ref.actor.megatron.param_offload=True \
   actor_rollout_ref.actor.megatron.grad_offload=True \
   actor_rollout_ref.actor.megatron.optimizer_offload=True \
   # For the ref w/o grad and optimizer
   actor_rollout_ref.ref.megatron.param_offload=True \

For the critic, you can include these parameters.

对于 critic，您可以包含这些参数。

.. code:: python

   # For the critic
   critic.megatron.param_offload=True \
   critic.megatron.grad_offload=True \
   critic.megatron.optimizer_offload=True \


Related MCore Document
----------------------

There is also a detailed document of using MCore to train different
kinds of models, please refer to `MCore Document <https://github.com/volcengine/verl/blob/main/verl/models/mcore/readme.md>`_.

相关 MCore 文档
---------------

还有一份使用 MCore 训练不同类型模型的详细文档，请参考 `MCore Document <https://github.com/volcengine/verl/blob/main/verl/models/mcore/readme.md>`_。