.. _checkpoint-page:

使用检查点支持容错训练
=======================

Last updated: 06/25/2025.

在整个 RLHF（Reinforcement Learning from Human Feedback，人机反馈强化学习）训练过程中，可能会出现训练错误或机器故障，因此建议启用检查点，以尽量减少损失。

API 接口已在 :ref:`config-explain-page` 中列出，我们不会重复说明。但仍然有一些技术细节需要澄清。

.. note:: 

    注意，``checkpoint.contents`` 字段对 FSDP 检查点无效（除非包含 ``hf_model``），其他 3 个字段绑定在一起用于保存和加载。我们推荐同时包含 ``model``、``optimizer`` 和 ``extra``。

检查点保存目录结构
-------------------

通常，我们使用 ``ppo_trainer.yaml`` 或 ``ppo_megatron_trainer.yml`` 中声明的 ``default_local_dir`` 作为保存检查点时的前缀，即 ``checkpoints/${trainer.project_name}/${trainer.experiment_name}``。

因此，**FSDP** 的内部检查点结构如下：

.. code::

    checkpoints/${trainer.project_name}/${trainer.experiment_name}
    ├── global_steps_${i}
    │   ├── actor
    │   │   ├── huggingface      # default save config and tokenizer, save huggingface model if include ``hf_model`` in checkpoint.contents
    │   │   └── fsdp_config.json # FSDP config file, including world_size and fsdp version
    │   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
    │   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
    │   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
    │   ├── critic
    │   │   ├── huggingface
    │   │   └── fsdp_config.json
    │   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
    │   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
    │   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
    └── latest_checkpointed_iteration.txt

所有模型分片、优化器和额外状态都以分片和分布式的方式一起存储。

而 **Megatron** 当前的检查点结构为：

.. code::

    checkpoints/${trainer.project_name}/${trainer.experiment_name}
    ├── global_steps_${i}
    │   ├── actor
    │   │   ├── huggingface     # default save config and tokenizer, save huggingface model if include ``hf_mode`` in checkpoint.contents
    │   │   └── dist_ckpt       # save sharded model/optimizer/rng_states, naming the same as Megatron
    │   └── critic
    │   │   ├── huggingface
    │   │   └── dist_ckpt
    └── latest_checkpointed_iteration.txt

将 FSDP 和 Megatron 检查点转换为 HuggingFace 格式模型
-------------------------------------------------------

我们提供了一个工具，用于将 FSDP 和 Megatron 检查点转换为 HuggingFace 格式模型。该工具位于 ``verl/model_merger``。对于较旧版本的 verl（检查点中不包含 fsdp_config.json），可以使用位于 ``verl/scripts/legacy_model_merger.py`` 的旧版模型合并器。

该脚本支持两个主要子命令：``merge``（用于转换和保存检查点）和 ``test``（用于验证合并后的检查点是否与参考模型一致）。

``merge`` 子命令的参数如下：

.. code:: bash

    usage: python -m verl.model_merger merge [-h] --backend {fsdp,megatron} [--local_dir LOCAL_DIR] [--tie-word-embedding] [--is-value-model] [--use_cpu_initialization] [--target_dir TARGET_DIR]
                         [--hf_upload_path HF_UPLOAD_PATH] [--private]

    options:
    -h, --help            show this help message and exit
    --backend {fsdp,megatron}
                            The backend of the model
    --local_dir LOCAL_DIR
                            Path to the saved model checkpoints
    --tie-word-embedding  Whether to tie word embedding weights (currently only Megatron supported)
    --is-value-model      Whether the model is a value model (currently only Megatron supported)
    --use_cpu_initialization
                            Whether to use CPU initialization for the model. This is useful for large models that cannot fit into GPU memory during initialization.
    --target_dir TARGET_DIR
                            Directory to save the merged huggingface model
    --hf_upload_path HF_UPLOAD_PATH
                            Hugging Face repository ID to upload the model
    --private             Whether to upload the model to a private Hugging Face repository

合并 Megatron 检查点的示例用法：

.. code:: bash

    python -m verl.model_merger merge \
        --backend megatron \
        --tie-word-embedding \
        --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
        --target_dir /path/to/merged_hf_model

分布式合并 Megatron 检查点的示例用法：

.. code:: bash

    torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} -m verl.model_merger merge \
        --backend megatron \
        --tie-word-embedding \
        --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
        --target_dir /path/to/merged_hf_model

合并 FSDP 检查点的示例用法：

.. code:: bash

    python -m verl.model_merger merge \
        --backend fsdp \
        --local_dir checkpoints/verl_fsdp_gsm8k_examples/qwen2_5_0b5_fsdp_saveload/global_step_1/actor \
        --target_dir /path/to/merged_hf_model


Megatron 合并器详情
--------------------

当前解码器层的实现使用 ``nn.ModuleList`` 来存储层，因此每个 PP（Pipeline Parallel，管道并行）排名和 VPP（Virtual Pipeline Parallel，虚拟管道并行）排名上的模型层索引都从 0 开始。

有 3 种方式可以纠正此行为：

1. 修改解码器层的 state_dict，向每个层的索引添加 ``offset``，从而重写 ``nn.ModuleList`` 的实现。
2. 在保存检查点时修改层索引，并在加载检查点时恢复它们。
3. 检查点合并器仅从 ``state_dict`` 计算实际的 ``offset``，这有点复杂。

当前实现使用方案 2。


HuggingFace 到 Megatron DistCheckpoint 的详情
---------------------------------------------

如果您的模型非常庞大，我们推荐使用 Megatron dist-checkpoint 来加载模型。Megatron dist-checkpoint 支持在不同类型的模型并行下加载，并且比原始检查点加载快得多。

要将原始 HuggingFace 模型转换为 Megatron dist-checkpoint，您可以使用 ``scripts/converter_hf_to_mcore.py`` 脚本。大型的 MoE 模型暂时支持通过 CPU 初始化来加载，这会稍微慢一些。我们正在致力于更好的解决方案来支持大型模型。

转换模型的示例命令如下：

.. code:: bash

    python scripts/converter_hf_to_mcore.py \
        --hf_model_path Qwen/Qwen1.5-MoE-A2.7B-Chat \
        --output_path /mnt/disk/Qwen/Qwen1.5-MoE-A2.7B-Chat \
        --use_cpu_initialization    # Only work for MoE models


分布式转换像 deepseekv3 671B 这样的大型模型的示例命令如下：

.. code:: bash

    torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} scripts/converter_hf_to_mcore.py \
        --hf_model_path deepseek-ai/DeepSeek-V3 \
        --output_path /mnt/disk/deepseek-ai/DeepSeek-V3 \
        --use_cpu_initialization    # Only work for MoE models

原始检查点工具
---------------

原始检查点工具是指 ``verl/models/[model]/megatron/checkpoint_utils`` 中的原始检查点实现。

现在我们只需使用原始检查点工具中的 ``[model]_loader.py``，因为我们不再每次都存储 ``hf_model``（这对于大型模型训练不推荐，如果可能的话，只保存分片模型）。

.. note:: 

    注意，``[model]_loader`` 只支持**存储集群能够与每个计算节点连接的环境**。因为它利用**分片加载方式来最小化加载检查点开销**。每个排名从 ``state_dict`` 加载自己的数据，这些数据所有排名都可以访问。同时，由于保存的 state_dict 仅由 DP（Data Parallel，数据并行）排名 0 生成，因此不需要在 DP 排名之间广播。

    对于**只能在单个设备上放置 HuggingFace 模型的用户**，我们保留了旧版成本较高的实现，即 ``[model]_loader_deprecated``。此实现中，排名 0 将所有权重广播到每个 TP（Tensor Parallel，张量并行）和 PP 排名，然后 DP 排名 0 广播到所有 DP 排名。可能会存在内存不足的风险。

    要使用已弃用的加载器，请更改 ``load_state_dict_to_megatron_llama`` 的导入包。