使用检查点支持容错训练

Last updated: 06/25/2025.

在整个 RLHF（Reinforcement Learning from Human Feedback，人机反馈强化学习）训练过程中，可能会出现训练错误或机器故障，因此建议启用检查点，以尽量减少损失。

API 接口已在 config-explain-page 中列出，我们不会重复说明。但仍然有一些技术细节需要澄清。

Note

注意，checkpoint.contents 字段对 FSDP 检查点无效（除非包含 hf_model），其他 3 个字段绑定在一起用于保存和加载。我们推荐同时包含 model、optimizer 和 extra。

检查点保存目录结构

通常，我们使用 ppo_trainer.yaml 或 ppo_megatron_trainer.yml 中声明的 default_local_dir 作为保存检查点时的前缀，即 checkpoints/${trainer.project_name}/${trainer.experiment_name}。

因此，FSDP 的内部检查点结构如下：

checkpoints/${trainer.project_name}/${trainer.experiment_name}
├── global_steps_${i}
│   ├── actor
│   │   ├── huggingface      # default save config and tokenizer, save huggingface model if include ``hf_model`` in checkpoint.contents
│   │   └── fsdp_config.json # FSDP config file, including world_size and fsdp version
│   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
│   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
│   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
│   ├── critic
│   │   ├── huggingface
│   │   └── fsdp_config.json
│   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
│   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
│   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
└── latest_checkpointed_iteration.txt

所有模型分片、优化器和额外状态都以分片和分布式的方式一起存储。

而 Megatron 当前的检查点结构为：

checkpoints/${trainer.project_name}/${trainer.experiment_name}
├── global_steps_${i}
│   ├── actor
│   │   ├── huggingface     # default save config and tokenizer, save huggingface model if include ``hf_mode`` in checkpoint.contents
│   │   └── dist_ckpt       # save sharded model/optimizer/rng_states, naming the same as Megatron
│   └── critic
│   │   ├── huggingface
│   │   └── dist_ckpt
└── latest_checkpointed_iteration.txt

将 FSDP 和 Megatron 检查点转换为 HuggingFace 格式模型

我们提供了一个工具，用于将 FSDP 和 Megatron 检查点转换为 HuggingFace 格式模型。该工具位于 verl/model_merger。对于较旧版本的 verl（检查点中不包含 fsdp_config.json），可以使用位于 verl/scripts/legacy_model_merger.py 的旧版模型合并器。

该脚本支持两个主要子命令：``merge``（用于转换和保存检查点）和 ``test``（用于验证合并后的检查点是否与参考模型一致）。

merge 子命令的参数如下：

usage: python -m verl.model_merger merge [-h] --backend {fsdp,megatron} [--local_dir LOCAL_DIR] [--tie-word-embedding] [--is-value-model] [--use_cpu_initialization] [--target_dir TARGET_DIR]
                     [--hf_upload_path HF_UPLOAD_PATH] [--private]

options:
-h, --help            show this help message and exit
--backend {fsdp,megatron}
                        The backend of the model
--local_dir LOCAL_DIR
                        Path to the saved model checkpoints
--tie-word-embedding  Whether to tie word embedding weights (currently only Megatron supported)
--is-value-model      Whether the model is a value model (currently only Megatron supported)
--use_cpu_initialization
                        Whether to use CPU initialization for the model. This is useful for large models that cannot fit into GPU memory during initialization.
--target_dir TARGET_DIR
                        Directory to save the merged huggingface model
--hf_upload_path HF_UPLOAD_PATH
                        Hugging Face repository ID to upload the model
--private             Whether to upload the model to a private Hugging Face repository

合并 Megatron 检查点的示例用法：

python -m verl.model_merger merge \
    --backend megatron \
    --tie-word-embedding \
    --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
    --target_dir /path/to/merged_hf_model

分布式合并 Megatron 检查点的示例用法：

torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} -m verl.model_merger merge \
    --backend megatron \
    --tie-word-embedding \
    --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
    --target_dir /path/to/merged_hf_model

合并 FSDP 检查点的示例用法：

python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/verl_fsdp_gsm8k_examples/qwen2_5_0b5_fsdp_saveload/global_step_1/actor \
    --target_dir /path/to/merged_hf_model

Megatron 合并器详情

当前解码器层的实现使用 nn.ModuleList 来存储层，因此每个 PP（Pipeline Parallel，管道并行）排名和 VPP（Virtual Pipeline Parallel，虚拟管道并行）排名上的模型层索引都从 0 开始。

有 3 种方式可以纠正此行为：

修改解码器层的 state_dict，向每个层的索引添加 offset，从而重写 nn.ModuleList 的实现。
在保存检查点时修改层索引，并在加载检查点时恢复它们。
检查点合并器仅从 state_dict 计算实际的 offset，这有点复杂。

当前实现使用方案 2。

HuggingFace 到 Megatron DistCheckpoint 的详情

如果您的模型非常庞大，我们推荐使用 Megatron dist-checkpoint 来加载模型。Megatron dist-checkpoint 支持在不同类型的模型并行下加载，并且比原始检查点加载快得多。

要将原始 HuggingFace 模型转换为 Megatron dist-checkpoint，您可以使用 scripts/converter_hf_to_mcore.py 脚本。大型的 MoE 模型暂时支持通过 CPU 初始化来加载，这会稍微慢一些。我们正在致力于更好的解决方案来支持大型模型。

转换模型的示例命令如下：

python scripts/converter_hf_to_mcore.py \
    --hf_model_path Qwen/Qwen1.5-MoE-A2.7B-Chat \
    --output_path /mnt/disk/Qwen/Qwen1.5-MoE-A2.7B-Chat \
    --use_cpu_initialization    # Only work for MoE models

分布式转换像 deepseekv3 671B 这样的大型模型的示例命令如下：

torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} scripts/converter_hf_to_mcore.py \
    --hf_model_path deepseek-ai/DeepSeek-V3 \
    --output_path /mnt/disk/deepseek-ai/DeepSeek-V3 \
    --use_cpu_initialization    # Only work for MoE models

原始检查点工具

原始检查点工具是指 verl/models/[model]/megatron/checkpoint_utils 中的原始检查点实现。

现在我们只需使用原始检查点工具中的 [model]_loader.py，因为我们不再每次都存储 ``hf_model``（这对于大型模型训练不推荐，如果可能的话，只保存分片模型）。

Note

注意，[model]_loader 只支持**存储集群能够与每个计算节点连接的环境**。因为它利用**分片加载方式来最小化加载检查点开销**。每个排名从 state_dict 加载自己的数据，这些数据所有排名都可以访问。同时，由于保存的 state_dict 仅由 DP（Data Parallel，数据并行）排名 0 生成，因此不需要在 DP 排名之间广播。

对于**只能在单个设备上放置 HuggingFace 模型的用户**，我们保留了旧版成本较高的实现，即 [model]_loader_deprecated。此实现中，排名 0 将所有权重广播到每个 TP（Tensor Parallel，张量并行）和 PP 排名，然后 DP 排名 0 广播到所有 DP 排名。可能会存在内存不足的风险。

要使用已弃用的加载器，请更改 load_state_dict_to_megatron_llama 的导入包。