Installation
============

Requirements
------------

- **Python**: Version >= 3.10
- **CUDA**: Version >= 12.8

verl 支持多种后端。目前，可用的配置包括：

- **FSDP** 和 **Megatron-LM**（可选）用于训练。
- **SGLang**、**vLLM** 和 **TGI** 用于 rollout 生成（注：rollout 指生成推理过程，用于 RL 算法中的推理阶段）。

Choices of Backend Engines
----------------------------

1. Training:

我们推荐使用 **FSDP** 后端来调查、研究并原型化不同的模型、数据集和 RL 算法。此后端的使用指南可在 :doc:`FSDP Workers<../workers/fsdp_workers>` 中找到。

对于追求更好可扩展性的用户，我们推荐使用 **Megatron-LM** 后端。目前，我们支持 `Megatron-LM v0.13.1 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.13.1>`_。此后端的使用指南可在 :doc:`Megatron-LM Workers<../workers/megatron_workers>` 中找到。


2. Inference:

对于推理，我们已经测试了 vllm 0.8.3 及更高版本的稳定性。我们推荐启用环境变量 `VLLM_USE_V1=1` 以获得最佳性能。

对于 SGLang，请参考 :doc:`SGLang Backend<../workers/sglang_worker>` 以获取详细的安装和使用说明。SGLang rollout 正在积极开发中，并提供许多高级功能和优化。我们鼓励用户通过 `SGLang Issue Tracker <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/106>`_ 报告任何问题或提供反馈。

对于 Hugging Face TGI 集成，通常用于调试和单 GPU 探索。

Install from docker image
-------------------------

从 v0.6.0 版本开始，我们使用 vllm 和 sglang 发布镜像作为基础镜像。

Base Image
::::::::::

- vLLM: https://hub.docker.com/r/vllm/vllm-openai
- SGLang: https://hub.docker.com/r/lmsysorg/sglang

Application Image
:::::::::::::::::

在基础镜像之上，添加了以下软件包：

- flash_attn
- Megatron-LM
- Apex
- TransformerEngine
- DeepEP

最新 docker 文件：

- `Dockerfile.stable.vllm <https://github.com/volcengine/verl/blob/main/docker/Dockerfile.stable.vllm>`_
- `Dockerfile.stable.sglang <https://github.com/volcengine/verl/blob/main/docker/Dockerfile.stable.sglang>`_

所有预构建镜像可在 Docker Hub 上找到：`verlai/verl <https://hub.docker.com/r/verlai/verl>`_。例如，``verlai/verl:sgl055.latest``、``verlai/verl:vllm011.latest``。

您可以在我们的 GitHub 工作流中找到用于开发和 CI 的最新镜像：

- `.github/workflows/vllm.yml <https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml>`_
- `.github/workflows/sgl.yml <https://github.com/volcengine/verl/blob/main/.github/workflows/sgl.yml>`_


Installation from Docker
::::::::::::::::::::::::

拉取所需的 Docker 镜像并安装所需的推理和训练框架后，可以按照以下步骤运行：

1. 启动所需的 Docker 镜像并连接到其中：

.. code:: bash

    docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
    docker start verl
    docker exec -it verl bash


2.	如果您使用我们提供的镜像，只需安装 verl 本身，无需依赖项：

.. code:: bash

    # install the nightly version (recommended)
    git clone https://github.com/volcengine/verl && cd verl
    pip3 install --no-deps -e .

[Optional] 如果您希望在不同框架之间切换，可以使用以下命令安装 verl：

.. code:: bash

    # install the nightly version (recommended)
    git clone https://github.com/volcengine/verl && cd verl
    pip3 install -e .[vllm]
    pip3 install -e .[sglang]


Install from custom environment
---------------------------------------------

我们推荐使用 Docker 镜像以方便起见。但是，如果您的环境与 Docker 镜像不兼容，也可以在 Python 环境中安装 verl。

.. note::

    - Dockerfile 提供的详细信息比这个安装指南更多。您可以在每个 Dockerfile 中找到示例，例如 `verl0.6-cu128-torch2.8.0-fa2.7.4 Dockerfile.base <https://github.com/volcengine/verl/blob/v0.6.0/docker/verl0.6-cu128-torch2.8.0-fa2.7.4/Dockerfile.base>`_ 。


Pre-requisites
::::::::::::::

为了让训练和推理引擎利用更好的硬件加速支持，需要安装 CUDA/cuDNN 和其他依赖项，
而且一些依赖项在安装其他软件包时容易被覆盖，
因此我们将它们放在 :ref:`Post-installation` 步骤中。

.. note::

    - 以下安装步骤是针对 verl 最新版本的推荐配置。

    如果您试图自定义自己的环境，请忽略严格的约束。

我们需要安装以下先决条件：

- **CUDA**: Version >= 12.8
- **cuDNN**: Version >= 9.10.0
- **Apex**

推荐使用高于 12.8 的 CUDA，就像 Docker 镜像一样，
请参考 `NVIDIA 的官方网站 <https://developer.nvidia.com/cuda-toolkit-archive>`_ 以获取 CUDA 的其他版本。

.. code:: bash

    # change directory to anywher you like, in verl source code directory is not recommended
    wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-ubuntu2204-12-8-local_12.8.1-570.124.06-1_amd64.deb
    dpkg -i cuda-repo-ubuntu2204-12-8-local_12.8.1-570.124.06-1_amd64.deb
    cp /var/cuda-repo-ubuntu2204-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
    apt-get update
    apt-get -y install cuda-toolkit-12-8
    update-alternatives --set cuda /usr/local/cuda-12-8


cuDNN 可以通过以下命令安装，
请参考 `NVIDIA 的官方网站 <https://developer.nvidia.com/rdp/cudnn-archive>`_ 以获取 cuDNN 的其他版本。

.. code:: bash

    # change directory to anywher you like, in verl source code directory is not recommended
    wget https://developer.download.nvidia.com/compute/cudnn/9.10.2/local_installers/cudnn-local-repo-ubuntu2204-9.10.2_1.0-1_amd64.deb
    dpkg -i cudnn-local-repo-ubuntu2204-9.10.2_1.0-1_amd64.deb
    cp /var/cudnn-local-repo-ubuntu2204-9.10.2/cudnn-*-keyring.gpg /usr/share/keyrings/
    apt-get update
    apt-get -y install cudnn-cuda-12

Install dependencies
::::::::::::::::::::

.. note::

    我们推荐使用全新的 conda 环境来安装 verl 及其依赖项。

    **请注意，推理框架通常严格限制您的 PyTorch 版本，如果不小心，它们会直接覆盖您已安装的 PyTorch。**

    作为应对措施，建议首先安装推理框架及其所需的 PyTorch。对于 vLLM，如果您希望使用现有的 PyTorch，
    请遵循他们的官方说明
    `Use an existing PyTorch installation <https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source>`_ 。


1. 首先，为了管理环境，我们推荐使用 conda：

.. code:: bash

   conda create -n verl python==3.12
   conda activate verl


2. 然后，执行我们在 verl 中提供的 ``install.sh`` 脚本：

.. code:: bash

    # Make sure you have activated verl conda env
    # If you need to run with megatron
    bash scripts/install_vllm_sglang_mcore.sh
    # Or if you simply need to run with FSDP
    USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh


如果在此步骤中遇到错误，请检查脚本并手动遵循脚本中的步骤。

[Optional] NVIDIA Apex 推荐用于 Megatron-LM 训练，但如果您仅使用 FSDP 后端，则不需要它。
您可以通过以下命令安装它，但请注意，此步骤可能需要很长时间。
建议设置 ``MAX_JOBS`` 环境变量以加速安装过程，
但不要设置得太大，否则内存会过载，您的机器可能会挂起。

.. code:: bash

    # change directory to anywher you like, in verl source code directory is not recommended
    git clone https://github.com/NVIDIA/apex.git && \
    cd apex && \
    MAX_JOB=32 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Install verl
::::::::::::

要安装最新版本的 verl，最好的方法是从源代码克隆并安装。这样，您就可以修改我们的代码来自定义您自己的后训练作业。

.. code:: bash

   git clone https://github.com/volcengine/verl.git
   cd verl
   pip install --no-deps -e .


Post-installation
:::::::::::::::::

请确保安装的其他软件包不会覆盖已安装的软件包。

值得检查的软件包包括：

- **torch** 和 torch 系列
- **vLLM**
- **SGLang**
- **pyarrow**
- **tensordict**
- **nvidia-cudnn-cu12**：用于 Magetron 后端

如果在运行 verl 时遇到软件包版本问题，请更新过时的软件包。


Install with AMD GPUs - ROCM kernel support
------------------------------------------------------------------

当您在配备 AMD GPU（MI300）和 ROCM 平台的系统上运行时，无法使用之前的快速入门来运行 verl。您应该按照以下步骤构建一个 Docker 镜像并运行它。
如果在使用 AMD GPU 运行 verl 时遇到任何问题，请随时联系我 - `Yusheng Su <https://yushengsu-thu.github.io/>`_。

查找 AMD ROCm 的 Docker：`docker/Dockerfile.rocm <https://github.com/volcengine/verl/blob/main/docker/Dockerfile.rocm>`_
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

.. code-block:: bash

    #  Build the docker in the repo dir:
    # docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
    # docker images # you can find your built docker
    FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

    # Set working directory
    # WORKDIR $PWD/app

    # Set environment variables
    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"

    # Install vllm
    RUN pip uninstall -y vllm && \
        rm -rf vllm && \
        git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
        cd vllm && \
        MAX_JOBS=$(nproc) python3 setup.py install && \
        cd .. && \
        rm -rf vllm

    # Copy the entire project directory
    COPY . .

    # Install dependencies
    RUN pip install "tensordict<0.6" --no-deps && \
        pip install accelerate \
        codetiming \
        datasets \
        dill \
        hydra-core \
        liger-kernel \
        numpy \
        pandas \
        datasets \
        peft \
        "pyarrow>=15.0.0" \
        pylatexenc \
        "ray[data,train,tune,serve]" \
        torchdata \
        transformers \
        wandb \
        orjson \
        pybind11 && \
        pip install -e . --no-deps

Build the image
::::::::::::::::::::::::

.. code-block:: bash

    docker build -t verl-rocm .

Launch the container
::::::::::::::::::::::::::::

.. code-block:: bash

    docker run --rm -it \
      --device /dev/dri \
      --device /dev/kfd \
      -p 8265:8265 \
      --group-add video \
      --cap-add SYS_PTRACE \
      --security-opt seccomp=unconfined \
      --privileged \
      -v $HOME/.ssh:/root/.ssh \
      -v $HOME:$HOME \
      --shm-size 128G \
      -w $PWD \
      verl-rocm \
      /bin/bash

如果您不想以 root 模式运行并需要将自己指定为用户，
请在上述 Docker 启动脚本中添加 ``-e HOST_UID=$(id -u)`` 和 ``-e HOST_GID=$(id -g)``。

verl 与 AMD GPU 目前支持 FSDP 作为训练引擎、vLLM 和 SGLang 作为推理引擎。我们将来会支持 Megatron。