参考链接

介绍文章： https://mp.weixin.qq.com/s/lPhi76_Aa0Ky4-qu5HTmrQ
Repo： https://github.com/vllm-project/vllm

迁移

原生支持的模型架构列表：


Architecture	Models	Example HuggingFace Models
GPT2LMHeadModel	GPT-2	gpt2, gpt2-xl, etc.
GPTBigCodeForCausalLM	StarCoder, SantaCoder, WizardCoder	bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc.
GPTNeoXForCausalLM	GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM	EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.
LlamaForCausalLM	LLaMA, Vicuna, Alpaca, Koala, Guanaco	openlm-research/open_llama_13b, lmsys/vicuna-13b-v1.3, young-geng/koala, JosephusCheung/Guanaco, etc.
OPTForCausalLM	OPT, OPT-IML	facebook/opt-66b, facebook/opt-iml-max-30b, etc.

如果你的模型使用了上述架构，那么你可以无缝地使用 vllm，否则你需要参考实现： https://vllm.readthedocs.io/en/latest/models/adding_model.html#adding-a-new-model

不支持的架构需要实现

0 Fork the vLLM repository

源码构建安装，而非 pip install

1 Bring your model code

将模型文件放在 vllm/model_executor/models 目录下

2 Rewrite the forward methods

重写 forward 方法（复数类）：

移除不必要的代码，例如仅用于训练过程的代码。
改写 forward 方法的参数。
Update the code by considering that input_ids and positions are now flattened tensors.
Replace the attention operation with either GPTPagedAttention or GPTNeoXPagedAttention, depending on the model’s architecture.

def forward(
    self,
    input_ids: torch.Tensor,
-    attention_mask: Optional[torch.Tensor] = None,
-    position_ids: Optional[torch.LongTensor] = None,
-    past_key_values: Optional[List[torch.FloatTensor]] = None,
-    inputs_embeds: Optional[torch.FloatTensor] = None,
-    labels: Optional[torch.LongTensor] = None,
-    use_cache: Optional[bool] = None,
-    output_attentions: Optional[bool] = None,
-    output_hidden_states: Optional[bool] = None,
-    return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]:
+    positions: torch.Tensor,
+    kv_caches: List[KVCache],
+    input_metadata: InputMetadata,
+    cache_events: Optional[List[torch.cuda.Event]],
+) -> Dict[int, SequenceOutputs]:

注意：vllm 原生支持 basic multi-head attention mechanism 和其变种的 rotary positional embeddings，如果模型使用其他 attention 机制，你需要自行实现一个 vllm 的 attention 层（vllm/model_executor/layers/attention.py）

4 Implement the weight loading logic

实现 *ForCausalLM 类中的 load_weights 方法，用于加载 hugginface 的 checkpoint 文件并分配参数到你模型的对应层中去。

This method should load the weights from the HuggingFace’s checkpoint file and assign them to the corresponding layers in your model. While the process is straightforward for most layers, the tensor-parallel layers necessitate some additional care as their weights should be partitioned to multiple GPUs.

5 Register your model

在 vllm/model_executor/models/__init__.py 中 import 你的 *ForCausalLM 类。

在 vllm/model_executor/model_loader.py 中的 _MODEL_REGISTRY 添加你的 *ForCausalLM 类。

适配 FinGPT

原生支持架构

FinGPT 使用 LlamaForCausalLM 架构，vllm 原生支持，因此理论上可以无缝使用。

不能使用 lora_weight

使用的话相当于改变架构？可能是要为 PerftModel 做适配。

开发者已经提上日程。

参考：

https://github.com/vllm-project/vllm/issues/182

https://github.com/vllm-project/vllm/issues/244

int8 量化

vllm 暂时不支持，但开发者已经提上日程。

参考：

https://github.com/vllm-project/vllm/issues/214

https://github.com/vllm-project/vllm/issues/244

使用 fp16

实例化 LLM 时可选参数 dtype ，其默认值为 auto ，即使用 fp16 。

问题和报错

构建报磁盘不足错误

pip install -e . 报错 ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device

使用 TMPDIR=your_tmp_dir pip install --cache-dir=your_cache_dir -e . ，也就是显式指定了临时文件夹和缓存文件夹。不知道为什么设置环境变量的方法不生效。

如果 pip 版本低于 20.3 ，应该也可以用 pip install -b $BUILD_PATH $PACKAGE_NAME 显式指定构建缓存的存储路径。

多卡推理相关

命令：NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES='0,1,2,3' python3 offline_inference.py

`Cuda failure 'peer access is not supported between these two devices'`

要使用环境变量 NCCL_P2P_DISABLE=1

参考： https://discuss.pytorch.org/t/torch-lib-c10d-processgroupnccl-cpp-825-unhandled-cuda-error-nccl-version-2-7-8/130405

`ValueError: The number of GPUs per node is not divisible by the number of tensor parallelism.`

可见显卡数量要能被 tensor_parallel_size 整除

`AssertionError: 49953 is not divisible by 2`

由于 vllm 的并行策略是将词库均分到每一个 GPU 上，所以词库大小要能被 tensor_parallel_size 整除。

“If your vocabulary size cannot be divided by the number of GPUs, you will see this error.”

解决方案是修改模型 config 中的 vocab_size ，同时，需要对加载的相关参数做调整。

具体来说，原先的词库量为 49953 ，我们在 get_model 中载入了 config 之后，将其修改为 49954。如此一来，每个 GPU 分配到的词库量是 24977，但载入的权重还是原来的，所以有一半会是 24976，此时需要将其扩充。

参考： https://github.com/vllm-project/vllm/issues/250

`ValueError: Total number of attention heads (40) must be divisible by tensor parallel size (6).`

num_attention_heads 要能被 tensor_parallel_size 整除

`ModuleNotFoundError: No module named 'xxx'`

假设你在 test.py 中加载模型，那么，将原先训练项目中的所有 module 文件都原样复制到与 test.py 同级的文件夹下。

如果是存储 state_dict 而非 model 本身，则可以避免这个问题。

参考：

https://blog.csdn.net/weixin_42815846/article/details/115289861

https://github.com/pytorch/pytorch/issues/18325

适配结果

成功，能够进行推理。

2023-06-27 20:10:02,659 INFO worker.py:1636 -- Started a local Ray instance.
INFO 06-27 20:10:03 llm_engine.py:59] Initializing an LLM engine with config: model='****', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)
INFO 06-27 20:12:44 llm_engine.py:128] # GPU blocks: 1302, # CPU blocks: 655
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.64it/s]
Prompt: 'The capital of France is', Generated text: ' Paris\n中文翻译：法国的首都是巴黎</s>'
Prompt: 'Hello, my name is', Generated text: "Jack. I'm the proprietor of this store.\n作为一家商店"
Prompt: 'The president of the United States is', Generated text: 'the leader of the free world.\n中文翻译：美国总统是自由世界的领导人'
Prompt: 'The future of AI is', Generated text: 'in a sense the future of computing,\n中文翻译：人工智能的未来，'

测试 vllm 对吞吐量的提升

数据处理

因为仅关心吞吐量，所以数据中只保留 prompts，并以 batch 的形式进行处理。

def group_batch(batch):
    return {k: [v] for k, v in batch.items()}

dataset_batched = dataset.map(group_batch, batched=True, batch_size=batch_size)

这里实际上是用 map 对数据集进行批处理，第一个参数指明了处理函数，此处是返回 batch 化的数据。

参考： https://zhuanlan.zhihu.com/p/557032513

without vllm

使用 alespalla/chatbot_instruction_prompts 的测试集来进行测试：

共使用 6 块 GPU，将词库等分配到 GPU:0；40 layers 平均分配到 GPU:1, 2, 3, 4, 5；input_ids 和 attention_mask 分配到 GPU:0
限制最大生成 token 数量为 128，为了控制总的运行时间，将 batch_size 设置为 20，总 prompts 数量为 10000。
GPU: 0 的显存占用约为3.2 GB，GPU:1 到 5 的显存占用约为 15 GB。
最终用时 80 min 完成了对 10000 条 prompts的生成。

with vllm

使用 alespalla/chatbot_instruction_prompts 的测试集来进行测试：

使用两块 GPU，限制最大生成 token 数量为 128，总prompts 数量为 10000，不做 batch。
两块 GPU 显存占用均为 23.3 GB（生成时，加载模型分别占用 13.5 GB）
最终用时 15 min 完成了对 10000 条 prompts 的生成

问题

tqdm 无法单行打印

如果本身有打印内容的操作，就会刷新 tqdm 的显示位置。

解决方法是不使用 print 而是使用 tqdm.write 方法，这会临时清除进度条，打印完后再继续输出进度条。

另外需要注意，tqdm.write 默认向 sys.stdout 输出，而 tqdm 模块默认向 sys.stderr 输出，这可能导致无法清空，因此需要 file=sys.stdout 将 tqdm 输出指定到 sys.out 。类似的，你也可以将 tqdm.write 重定向到 sys.stderr 。

版本号规范

在设置一个包的 __init__.py 文件时，__version__ 变量的值通常应该是符合语义版本控制规范（Semantic Versioning）的字符串。

语义版本控制规范的格式是 MAJOR.MINOR.PATCH，其中：
- MAJOR 表示主要版本号，当你进行了不兼容的 API 更改时增加。
- MINOR 表示次要版本号，当你进行了向后兼容的功能性新增或修改时增加。
- PATCH 表示修订号，当你进行了向后兼容的问题修复时增加。

除了这三个数字之外，你还可以使用预发布版本号和构建元数据。预发布版本号可以在版本号之后加上一个连字符 -，然后是一个标识符，用于表示开发阶段的版本（如 alpha、beta 或 rc）。构建元数据可以在版本号之后加上加号 +，然后是一个标识符，用于指定构建的唯一标识符。

回到你的问题，"0.0.1" 是一个符合语义版本控制规范的版本号，因此不会引发错误。然而，"0.0.1-fingpt" 不符合规范，因为它在修订号后面添加了一个非预发布版本的标识符，可能会导致报错。

如果你想使用带有预发布版本号和构建元数据的版本号，你可以采用如下格式：

MAJOR.MINOR.PATCH-预发布版本号+构建元数据

请确保你的版本号符合语义版本控制规范，并根据你的实际需求进行设置。这样可以提高包的可读性，并使其他开发者能够更好地理解你的版本号并进行相应的处理。

伯克利开源LLM推理与服务库：vLLM