多轮展开支持

最后更新：06/27/2025。

基础配置

要启用多轮展开，请确保在 rollout 配置中配置以下字段：

actor_rollout_ref:
    rollout:
        multi_turn: True
        name: "sglang"

这些配置会激活 sglang 引擎，用于 rollout 期间的多轮交互。

自定义工具配置

对于自定义环境交互工具，你可以基于 verl.tools.base_tool.BaseTool 实现自己的工具。然后，在 YAML 文件中指定工具配置：

tools:
  - class_name: ""
    config:
        type: native
    tool_schema:

你可以使用 GSM8KTool_example_configuration 作为工具配置的一个示例，其实现可在 gsm8k_tool.py 中找到。

最后，在 rollout 配置中设置 tools_config_file：

actor_rollout_ref:
    rollout:
        tool_kwargs:
            tools_config_file: <path_to_tool_yaml_file>

这允许在 actor rollout 步骤中集成自定义工具行为。

如果你希望 rollout 带有模拟交互，可以在 rollout 配置中设置 interaction_config_file：

interaction:
  - class_name: ""
    config: {}

actor_rollout_ref:
    rollout:
        interaction_config_file: <path_to_interaction_yaml_file>

如果你的工具会创建多模态输入，你应该在工具的 execute() 实现中返回多模态输入的列表。

图像和视频应在使用前进行处理。例如，如果你正在使用 Qwen2.5-VL，可以使用以下代码来获取表示：

async def create(self, ...) -> tuple[str, ToolResponse]:
    ...
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # due to the (image | video) key is ("image" | "video") instead of ("images" | "videos") in vllm, we need to use ("image" | "video") to specify list of images/videos
    # link: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
    return instance_id, ToolResponse(image=[img1, ...], video=[video1, ...], text="...")

async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    ...
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # due to the (image | video) key is ("image" | "video") instead of ("images" | "videos") in vllm, we need to use ("image" | "video") to specify list of images/videos
    # link: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
    return ToolResponse(image=[img1, ...], video=[video1, ...], text="..."), 0, {}

记住，要在数据集配置中设置 return_multi_modal_inputs: False，以便在 rollout 中正确处理多模态输入。参考 `Handling Multi-Modal Inputs in Datasets`_ 部分以获取更多详情。

MCP 工具配置

对于 MCP 交互工具，你可以使用 YAML 文件进行灵活配置。典型设置如下：

tools:
  - class_name: ""
    config:
        type: mcp
    mcp:
        mcp_servers_config_path: ./mcp_server.json
        tool_selected_list: {}

tool_selected_list 字段是可选的，它指定从服务器中使用的工具。如果你想启用所有可用工具，只需省略此属性。此外，mcp_servers_config_path 指向一个包含 MCP 服务器配置的 JSON 文件。例如：

{
    "mcpServers": {
        "SSE Server": {
            "url": "your_server_url",
            "auth_token": "your_server_api_token"
        },
        "STDIO Server": {
            "command": "npx",
            "args": ["-y", "server-mcp@0.2.1"],
            "env": {
              "SERVER_API_KEY": "your_server_api_token"
            }
        }
    }
}

由于 MCP 服务器返回的内容格式可能各不相同，用户可以继承自 MCPBaseTool 并覆盖 _parse_tool_result 方法，以实现自定义解析逻辑。

class MCPYourTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        ...

总体来说，你可以参考 mcp_search_tool.py 和 mcp_tool_config.yaml 以获取自定义实现和配置。

多轮分词

对多轮 rollout 进行分词是一个挑战：应用聊天模板并分词完整消息列表后，很难识别哪些 token 属于助手消息。由于 token 列表是平坦的，它缺乏与消息角色的直接对齐。

为了解决这个问题，我们采用 基于差分的 tokenization 策略。每次大模型生成新消息时，我们：

将聊天模板应用到所有先前的消息（messages[:i]）。
再次将聊天模板应用到包括最新消息在内的消息（messages[:i+1]）。
只分词这两种序列化消息字符串之间的差分。

这确保只有助手生成的 token 被包含在 loss mask 中。

# When using tokenizer
# Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
prev = tokenizer.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)  # Mask only the new assistant tokens

# When using processor
# Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
prev = processor.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
prev_model_inputs = processor(text=prev, images=images, videos=videos, return_tensors="pt")[0].tolist()
curr = processor.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
curr_model_inputs = processor(text=curr, images=images, videos=videos, return_tensors="pt")[0].tolist()
token_ids += curr_model_inputs["input_ids"][len(prev_model_inputs["input_ids"]):]
loss_mask += [1] * len(token_ids)  # Mask only the new assistant tokens

虽然我们验证这与完整消息 tokenization 产生一致的结果，但未来模型的聊天模板可能破坏兼容性。为了防止静默不一致，我们默认在每次 rollout 结束时将基于差分的 tokenization 与完整分词结果进行比较。

如果你看到以下警告，可以在日志中检查不匹配的子字符串：

Inconsistent training and inference tokenization detected. This may lead to unexpected behavior during training. Please review your chat template to determine if this is intentional. For more information, refer to the multiturn README.md.

分词 sanity check 模式可以通过 actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode 参数进行配置，该参数接受以下值：

strict (默认)：在基于差分的 tokenization 和完整分词结果之间执行严格比较，如果有任何差异则发出警告。
ignore_strippable：忽略空白字符（\n、 \t、 \r、空格）的差异，但仍检查有意义文本的不匹配。这对调试聊天模板问题有用，因为空白变化是预期且可接受的。
disable：完全禁用分词 sanity check。只在你已彻底验证分词差异是预期且不会影响训练时使用。

示例配置：

actor_rollout_ref:
    rollout:
        multi_turn:
            tokenization_sanity_check_mode: "ignore_strippable"  # Choose from: "disable", "ignore_strippable", "strict"

处理数据集中的多模态输入

如果你的数据集包含多模态输入（如图像或视频），你可以通过在数据集 config 中设置 return_multi_modal_inputs 标志来控制这些输入是否被预处理并包含在每个样本中（由 RLHFDataset 使用）。

return_multi_modal_inputs: True (默认)：数据集将预处理并为每个样本包含一个 multi_modal_inputs 字典。此字典包含处理器生成的模型就绪表示（如图像张量、视频张量等）。这对单轮或 SFT 风格的训练有用，其中模型期望批次中存在所有模态。
return_multi_modal_inputs: False：数据集不会包含 multi_modal_inputs 字段。这对多轮 RL 或带工具的 rollout 推荐，因为模型可能在 rollout 期间动态生成新的多模态输入，你想避免批次中的冲突或冗余数据。

特殊情况

某些模型（如 Qwen/QwQ-32B 和 Qwen3 系列）在聊天模板渲染时会移除内部推理内容。因此，消息内容可能会跨轮变化，这会使基于差分的 tokenization 不准确。

例如，对于以下对话：

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"},
    {"role": "assistant", "content": "<think>user asked about a simple math question.</think> 2 + 2 = 4."},
    {"role": "user", "content": "Explain why."},
    {"role": "assistant", "content": "<think>user wants to know the reasoning behind the answer. Search for a good explanation</think>",
     "tool_calls": [{"id": "tool1", "type": "search", "arguments": {"query": "Why is 2 + 2 = 4?"}}]},
    {"role": "tool", "content": "The sum of two and two is four because it is a basic arithmetic operation."},
    {"role": "assistant", "content": "<think>The tool provided a good explanation.</think>The sum of two and two is four because it is a basic arithmetic operation."}
]

Qwen/QwQ-32B 将在应用聊天模板后移除除最后一个助手消息外的所有推理内容。

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
 2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>The tool provided a good explanation.</think> The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

Qwen3 系列将移除最后一个用户消息之前的推理内容。

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
 2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<think>
user wants to know the reasoning behind the answer. Search for a good explanation
</think>

<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>
The tool provided a good explanation.
</think>

The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

为了处理这种情况，我们回退到一个 固定的基础对话，它只包含单一的系统和用户消息。由于此基础不包含助手消息或推理内容，它跨轮保持一致。

BASE_CHAT_HISTORY = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "I am a user."}
]
prev = tokenizer.apply_chat_template(BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template([*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)

此方法对 Qwen3 系列效果良好。但是，Qwen/QwQ-32B 当前在其聊天模板中有一个 bug。一个 fix 已被提出但尚未采用。在此之前，使用以下命令下载修复后的模型版本：

pip install huggingface_hub
huggingface-cli download Qwen/QwQ-32B --revision refs/pr/81

训练和推理模板之间的差异

尽管上述方法修复了差分不匹配问题，但在推理时聊天模板移除推理内容会引入新的差异：训练使用完整推理内容，而推理则不使用。

此不匹配可能会以不可预测的方式影响模型性能。为避免此问题，我们默认对训练和 rollout 同时使用完整响应（包括推理）。

但是，此方法伴随权衡：

长推理内容很容易超出模型的上下文窗口，尤其在多轮 rollout 中。
现在 rollout 与生产环境之间有不匹配——如果你在生产中使用默认聊天模板，模型将不会从过去轮次中拥有推理内容。

我们仍在评估这些问题的影响。如果你遇到上下文长度问题，或更倾向于与生产匹配的 rollout（即排除推理），你可以启用：

actor_rollout_ref.rollout.multi_turn.use_inference_chat_template = True

GSM8K 多轮训练性能

请查看在 GSM8K 任务上多轮 rollout 的训练性能 HERE。

交互系统

对于 RL 训练期间的动态对话反馈，请查看：

Interaction System for Multi-turn RL Training

搜索工具集成

搜索工具集成

代码演练

如果你想深入了解代码执行流程，请阅读 https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/rlhf/verl/multi-turn/code-walk-through