# 配方：自竞赛偏好优化 (SPPO)

上次更新：05/28/2025.

verl 提供了针对论文 [Self-Play Preference Optimization for Language Model Alignment](https://arxiv.org/abs/2405.00675) 的社区配方实现。 SPPO（自竞赛偏好优化）可以在没有强外部信号（如 GPT-4 的响应或偏好）的情况下显著提升大语言模型 (LLM) 的性能。它甚至能超越使用迭代直接偏好优化 (Iterative Direct Preference Optimization, DPO) 等方法训练的模型。SPPO 在理论上有坚实基础，能够在大语言模型在一般情况下（甚至是非传递性偏好）收敛到 von Neumann 赢家（即纳什均衡），并通过在多个数据集上的广泛评估得到了实证验证。

论文作者：[Yue Wu](https://yuewu.us/)\*、[Zhiqing Sun](https://www.cs.cmu.edu/~zhiqings/)\*、[Huizhuo Yuan](https://scholar.google.com/citations?user=8foZzX4AAAAJ)\*、[Kaixuan Ji](https://scholar.google.com/citations?user=FOoKDukAAAAJ)、[Yiming Yang](https://www.cs.cmu.edu/~yiming/)、[Quanquan Gu](https://web.cs.ucla.edu/~qgu/)

verl 实现作者：[Yuhao Yang](https://github.com/yhyang201)、[Chenyang Zhao](https://github.com/zhaochenyang20)

[[Webpage](https://uclaml.github.io/SPPO/)] [[Huggingface](https://huggingface.co/papers/2405.00675)] [[Paper](https://arxiv.org/abs/2405.00675)][[Original Implementation](https://github.com/uclaml/SPPO)]

## 重现实验

我们使用 MATH 数据集评估 SPPO 的性能。从 Qwen2.5-7B-Instruct 的初始得分 46.6 开始，经过 20 轮训练后，我们达到了 65.6 的得分，使我们的模型在 [MATH 排行榜](https://paperswithcode.com/sota/math-word-problem-solving-on-math) 上大约进入前 20 名。需要注意的是，verl 内部的评估指标可能与 Qwen2.5-7B-Instruct 的官方评估方法不完全一致。因此，为了保持一致性和公平比较，我们仅报告基于 verl 评估框架的结果。

```
git clone git@github.com:volcengine/verl.git
cd verl
python3 -m uv pip install -e ".[sglang]"

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir $HOME/models/Qwen2.5-7B-Instruct

export CUDA_VISIBLE_DEVICES=0,1,2,3
bash recipe/sppo/run_qwen2.5-7b_rm.sh
```

请注意，安装过程偶尔可能会失败安装 flash-attn。如果发生这种情况，您可以通过运行以下命令手动安装：

```bash
python3 -m uv pip install wheel
python3 -m uv pip install packaging
python3 -m uv pip install flash-attn --no-build-isolation --no-deps
```

## 致谢

我们衷心感谢以下人员的贡献和指导：

- [Yue Wu](https://yuewu.us/)
- [Chendong Wang](https://cdwang96.github.io/)
- [Yifan Zhang](https://github.com/yifanzhang-pro)
- [Yongan Xiang](https://github.com/BearBiscuit05)
- [Junrong Lin](https://github.com/ocss884)
- [Yuxuan Tong](https://github.com/tongyx361)
- [Guangming Shen](https://github.com/PeterSH6)
- [Biao He](https://www.linkedin.com/in/biao-he/)
- [Qingquan Song](https://qingquansong.github.io/)
- [Quanquan Gu](https://web.cs.ucla.edu/~qgu/)