Agent 循环
==========

最后更新：07/17/2025。

.. versionadded:: 0.4.2
   [status: alpha]

.. warning::
   Agent Loop 已可使用，但 API 可能在未来版本中发生变化。

Agent Loop 被设计为多轮互动（multi-turn rollout）和智能体增强学习（agentic reinforcement learning）的通用接口。

**设计目标**：

- 可插拔的用户定义 Agent 循环
- 提供标准请求生成 API，与不同推理框架兼容
- 提供请求级别的负载均衡，在多个推理服务器之间分摊负载

**非目标**：

- 如何定义工具以及如何调用工具

从高层次概述，Agent 循环接收一个提示（prompt），运行用户定义的循环：调用 LLM 生成 API、调用工具等，最终返回输出。然后，该输出会被计算奖励，并用作 RL 训练的轨迹（trajectory）。

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_overview.svg?raw=true


API 设计
----------

``AgentLoopBase`` 类是 Agent 循环的抽象，用户只需实现 ``run`` 方法作为唯一接口。该 ``run`` 方法接收提示消息（格式为：[{"role": "user"}, {"content": "..."}]）以及额外的采样参数，可以执行用户想要的操作，例如：

- 调用 LLM 生成 API
- 调用工具：网页搜索、数据库查询、代码沙盒等
- 环境互动
- 反思
- ...

.. code:: python

   class AgentLoopBase(ABC):
       @abstractmethod
       async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput:
           """Run agent loop to interact with LLM server and environment.

           Args:
               sampling_params (Dict[str, Any]): LLM sampling params.
               **kwargs: dataset fields from `verl.utils.dataset.RLHFDataset`.

           Returns:
               AgentLoopOutput: Agent loop output.
           """
           raise NotImplementedError

运行用户定义的循环后，``run`` 方法应返回 ``AgentLoopOutput``，该对象包含提示 token ids、响应 token ids 和响应掩码。

.. code:: python

   class AgentLoopOutput(BaseModel):
       """Agent loop output."""

       prompt_ids: list[int]
       """Prompt token ids."""
       response_ids: list[int]
       """Response token ids including LLM generated token, tool response token."""
       response_mask: list[int]
       """Response mask, 1 for LLM generated token, 0 for tool response token."""

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_output.svg?raw=true

.. note:: AgentLoopOutput 只为给定提示输出一个轨迹，多个轨迹的输出仍在讨论中。

架构设计
-----------

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_architecture.png?raw=true

一个单一的 PPO 步骤包含两个阶段：rollout（展开）和 train（训练）。在 rollout 阶段：

1. PPOTrainer 从数据集中采样一批数据，并调用 ``AgentLoopManager.generate_sequences``。
2. AgentLoopManager 会 ``wake_up`` 所有异步 LLM 服务器实例，这将同步推理引擎（vLLM/SGLang）和训练引擎（FSDP/Megatron-LM）之间的权重。
3. AgentLoopManager 将批次拆分为块，并将每个块发送给 ``AgentLoopWorker``。
4. AgentLoopWorker 接收块，对于每个提示，生成一个用户定义的 ``AgentLoopBase`` 实例，运行 ``run`` 协程直到结束，并获取 ``AgentLoopOutput``。

.. tip::
   AgentLoopWorker 会并发调度多个协程。如果 AgentLoopWorker 的数量等于 batch_size，则每个 worker 负责一个提示。

在 Agent 循环中，当用户需要 LLM 生成响应时：

5. 调用 ``AsyncLLMServerManager.generate`` 并传入 prompt_ids。
6. AsyncLLMServerManager 在第一轮选择请求最少的服务器实例并发送请求。（在后续轮次中，请求将发送到同一个服务器实例）。
7. AsyncLLMServer 接收请求，通过 ipc/rpc 与 model_runner 交互，并生成响应。（vLLM 和 SGLang 之间存在细微差异，详见下文）。

当所有 AgentLoopWorker 中的所有提示完成后，AgentLoopManager 会汇总结果并返回给 PPOTrainer。

8. AgentLoopManager 会 ``sleep`` 所有服务器实例，这将释放 KV 缓存并将权重卸载到 CPU 内存。

AsyncLLMServer
~~~~~~~~~~~~~~

AsyncLLMServer 是 LLM 服务器的抽象，提供两种类型的生成 API：

- `OpenAI chat completion <https://platform.openai.com/docs/api-reference/chat>`_：为给定的聊天对话生成响应。
- Token 输入输出：为给定的 token ids 生成响应 ids。

我们官方支持 vLLM 和 SGLang AsyncLLMServer，两者都实现了这两个 API 并经过充分测试。其他推理引擎可以通过实现 ``AsyncServerBase`` 类轻松集成。

.. code:: python

   class AsyncServerBase(ABC):
       @abstractmethod
       async def chat_completion(self, raw_request: Request) -> JSONResponse:
           """OpenAI chat completion API.

           Args:
               raw_request (Request): raw json request
           
           Returns:
               JSONResponse: json response

           API reference: https://platform.openai.com/docs/api-reference/chat/create
           """
           raise NotImplementedError

       @abstractmethod
       async def generate(self, prompt_ids: list[int], sampling_params: dict[str, Any], request_id: str) -> list[int]:
           """Generate response ids given prompt ids.

           Args:
               prompt_ids (List[int]): prompt ids
               sampling_params (Dict[str, Any]): sampling params
               request_id (str): request id

           Returns:
               List[int]: response ids
           """
           raise NotImplementedError


Chat completion vs Token in token out
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. warning::
   下述结论基于我们最近的经验，仍可进一步调查和讨论。

几乎所有智能体框架（LangGraph、CrewAI、LlamaIndex 等）都使用 OpenAI chat completion API 调用 LLM，并以消息形式维护聊天历史。因此，用户可能期望我们在多轮互动中也使用 chat completion API。

但基于我们在 DAPO 上进行单轮训练和 `retool <https://github.com/volcengine/verl/tree/main/recipe/retool>`_ 上进行多轮训练的近期经验，我们发现：对最终消息应用 token_ids 可能不等于在每轮中连接 prompt_ids 和 response_ids 的结果。

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/multi_turn.png?raw=true

**这种不一致是如何发生的？**

首先，工具解析器可能会改变内容。例如：

.. code:: json

   {"role": "assistant", "content": "Let me call a <tool_call>...</tool_call> and get the result"}

工具调用提取后，消息变为：

.. code:: json

   {"role": "assistant", "content": "Let me call a and get the result", "tool_calls": [{"name": "foo", "arguments": "{}"}]}

将提取的消息重新编码（encode）后，可能不等于原始的 LLM 生成 response_ids。

其次，`decode-encode` 过程也可能导致不一致：`Agent-R1 issue#30 <https://github.com/0russwest0/Agent-R1/issues/30#issuecomment-2826155367>`_。

**这种不一致的影响是什么？**

这种不一致对服务/智能体系统而言不是大问题，但对 RL 训练至关重要。它会导致轨迹偏离策略模型分布。我们观察到，对最终聊天历史消息应用 apply_chat_template 会使单轮 PPO 训练甚至无法收敛。

vLLM
^^^^

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_vllm.png?raw=true

对于 vLLM，Async LLM Engine 在与服务器相同的进程中运行，而 ModelRunner 在与 FSDP/Megatron-LM worker 相同的进程中运行。Async LLM Engine 通过 ZeroMQ 与 ModelRunner 通信。当服务器接收请求时，它直接调用引擎生成 response_ids。

SGLang
^^^^^^

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_sglang.png?raw=true

对于 SGLang，Async LLM Engine 在与 FSDP/Megatron-LM worker-0 相同的进程中运行，并生成多个子进程作为 ModelRunner。同时，Async LLM Engine 通过 ZeroMQ 与 ModelRunner 通信。当服务器接收请求时，它远程调用 worker-0 并获取 response_ids。

AsyncLLMServerManager
~~~~~~~~~~~~~~~~~~~~~

AsyncLLMServerManager 作为多个 AsyncLLMServer 实例的代理，提供：

- 负载均衡：在第一轮选择请求最少的服务器实例并发送请求。
- 粘性会话（sticky session）：将 request_id 绑定到服务器实例，以确保同一 request_id 在后续轮次中发送到同一个服务器实例。

AsyncLLMServerManager 被传递给 ``AgentLoopBase.__init__``，每当用户在 Agent 循环中想要与 LLM 互动时，可以调用 ``AsyncLLMServerManager.generate`` 来生成 response_ids。

.. code:: python

   class AsyncLLMServerManager:
       async def generate(
           self,
           request_id,
           *,
           prompt_ids: list[int],
           sampling_params: dict[str, Any],
       ) -> list[int]:
           """Generate tokens from prompt ids.

           Args:
               request_id (str): request id for sticky session.
               prompt_ids (List[int]): List of prompt token ids.
               sampling_params (Dict[str, Any]): Sampling parameters for the chat completion.

           Returns:
               List[int]: List of generated token ids.
           """
           ...

下一步
----

- :doc:`Agentic RL Training<../start/agentic_rl>`：使用 gsm8k 数据集快速开始 Agentic RL 训练。
- `LangGraph MathExpression <https://github.com/volcengine/verl/tree/main/recipe/langgraph_agent/example>`_：演示如何使用 LangGraph 构建 Agent 循环。
- `Retool <https://github.com/volcengine/verl/tree/main/recipe/retool>`_：使用工具智能体的端到端 Retool 论文复现。