Model Engine
============

.. _vermouth: https://github.com/vermouth1992

Author: `Chi Zhang <https://github.com/vermouth1992>`_

Last updated: 09/25/2025.

Current Support Matrix
----------------------

+----------+-----------+--------------+-------------+--------------------------+
| Backends | Model     | Scalability  | Model       | Pain points              |
|          | Supported |              | Definition  |                          |
|          |           |              |             |                          |
+==========+===========+==============+=============+==========================+
| FSDP     | Day 1     | - Dense is OK| Huggingface | Monkey patch can be      |
| +        | support   |              | + monkey    | easily impacted by       |
| ulysses  | HF model  | - MoE is bad | patch       | transformers version     |
+----------+-----------+--------------+-------------+--------------------------+
| MCore    | Limited   | Best         | GPTModel    | Supporting new models is |
|          |           |              | (One model  | difficult                |
|          |           |              | for all)    |                          |
+----------+-----------+--------------+-------------+--------------------------+

-  我们通过猴子补丁（monkey patch）来修改注意力函数，以支持 ulysses
-  我们通过猴子补丁来修改 VLM 模型，以支持 FSDP 处理混合数据（包括有图像和无图像的数据）

Class Hierarchy
---------------

Note that all the workers and trainers run in **SPMD** mode. SFT/DPO/RM
trainer is directly invoked by ``torchrun``. The Actor/Critic worker can
also be invoked by a RayWorkerGroup and provides APIs to a single
controller.

-  基础引擎层级：实现模型初始化、优化器初始化、学习率调度器初始化、分片、检查点管理器
-  完整引擎层级：继承基础引擎，并实现 ``forward_step``
-  工作器/SPMD 训练器层级：**与引擎无关**，使用抽象引擎 API 实现训练逻辑

RL trainer utilizes workers to construct HybridFlow program. This is out
of the scope of model engine.

Existing Model Types
--------------------

========== ====================== ======================
Model type Language model         Value model
========== ====================== ======================
Input      text/image/video/audio text/image/video/audio
Output     logits for next token  logits as value
========== ====================== ======================

Currently, we have two model types: language model and value model. We
expect to expand the category to include Qwen-Omni family (output both
text and audio) and VLA models.

Data Format
-----------

Currently, verl adopts left-right padding data format in RL trainer.
This creates massive padding when the discrepancy between response
length is large. We will start to implement no-padding format throughout
the whole system.

.. image:: https://github.com/vermouth1992/verl-data/blob/master/images/data_format.png?raw=true
   :alt: Data Format

Here is the migration plan:
- Implement no-padding format in engine
- Add a transformation layer in Actor/Critic worker.
- Replace Actor/Critic Worker in RL trainer
- Implement no-padding throughput system

Checkpoint System
-----------------

.. image:: https://github.com/vermouth1992/verl-data/blob/master/images/verl-ckpt.png?raw=true
   :alt: Model Engine Checkpoint System

The engine constructs the model using huggingface config, then load
weights from huggingface checkpoint. If the engine directly uses
huggingface model definition, it can use function provided by
``transformers``. Otherwise, each engine has to write their own
checkpoint load logic (e.g.,
`mbridge <https://github.com/ISEEKYAN/mbridge>`__). During model
training, each engine has to implement save_checkpoint and
load_checkpoint that save/load intermediate sharded checkpoint including
model, optimizer and lr scheduler states. Each engine has to implement a
checkpoint merge script, that merges the intermediate sharded checkpoint
back to huggingface format.

API
---

A tentative model engine API can be found:
https://github.com/volcengine/verl/blob/main/verl/workers/engine/base.py#L24

Extension
---------

Add a new backend
~~~~~~~~~~~~~~~~~

-  Start a new folder under ``verl/workers/engine``. Then, implement
   ``transformer_impl.py``. If you want to implement a non-transformer
   model, please contact us in advance.
-  Add the engine config to the GSM8k SFT trainer script:
   https://github.com/volcengine/verl/blob/main/tests/special_e2e/sft/run_sft_engine_gsm8k.sh
-  Invoke the tests with your backend:
   https://github.com/volcengine/verl/blob/main/tests/special_e2e/sft/test_sft_engine_all.sh.
   This test script will run various backends and various
   configurations, and compare the loss and grad norm of the first step
   to make sure they are close.

Add a new model type
~~~~~~~~~~~~~~~~~~~~

-  This is mainly reserved for models whose the output is not just text
   (e.g., Qwen3-Omni). Please discuss with us before you proceed.