参考视频：https://www.bilibili.com/video/BV1yu411L7JN/

仓库：https://github.com/EvilPsyCHo/train_custom_LLM

输入构建

单轮对话

直接拼接数据集中的输入和输出。缺点：不便于模型判断什么时候终止对话。
在尾部添加 eos_token。缺点：模型不好判断哪个是输入哪个是输出。
标识出输入输出，比如添加 Instruction: 和 Output:。

多轮对话

在输入中加入轮次信息。例如，对于每一轮历史对话，加上 [Round x]\n\n，当前轮次也要加上，例如：

def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None):
    prompt = ""
    for i, (old_query, response) in enumerate(history):
        prompt += "[Round {}]\n\n问：{}\n\n答：{}\n\n".format(i + 1, old_query, response)
    prompt += "[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
    inputs = tokenizer([prompt], return_tensors="pt")
    inputs = inputs.to(self.device)
    return inputs

Label Mask

我们希望模型去学习如何回答，而不用预测我们额外添加的信息，比如输入输出标识和轮次信息。

因此我们可以使用 Label Mask 去屏蔽这部分的 loss。

模型

模型加载（量化或非量化）

baichuan-7B显存开销 float16 (13GB), 8bit (7.1GB), 4bit (4.1GB)
模型定义变化 torch.nn.Linear => bitsandbytes.nn.Linear8bitLt => bitsandbytes.nn.Linear4bit

模型准备

冻结参数，Lora 训练不改变原模型参数，因此需要冻结。
调整输出层精度，对 loss 的计算有好处。
如果使用了 gradient_checkpoint，梯度将被写到显存以外的地方来节省显存，如此 lora 拿不到这部分梯度，因此需要额外设置 enable_input_require_grads。

以上准备可以很方便地由 peft.prepare_model_for_kbit_training 实现。

Lora 配置

找到需要使用 lora 训练的 layer module names，有两个方法：
1. 遍历模型的所有 module 和对应的 name，如果是目标 class，则记录下来并将其返回，注意，如果 lm_head （输出层）也在其中，需要将其剔除。示例：
```
def find_all_linear_names(model):
#cls = bnb.nn.Linear8bitLt
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
    names = name.split('.')
    lora_module_names.add(names[0] if len(names) == 1 else names[-1])
```
if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')
return list(lora_module_names)
```
2. 直接根据模型信息，将目标层作为参数传入（LoraConfig.target_modules）。
```
lora 配置
get_peft_model 得到 lora 挂载模型。

Label shift and Loss

一般使用 CrossEntropyLoss。使用 ignore_index=tokenizer.pad_token_id 来屏蔽填充项的损失。
在生成时，输入和标签是一致的，而模型是根据上一个（实际上是之前的序列） token 生成下一个 token，因此生成出来的序列相比 lable 有偏移，所以要修正。例如，输入和标签均为 [1, 2, 3, 4] ，那么生成将是 [2, 3, 4, 5] ，所以我们去掉标签的第一位，以及生成的最后一位，这样两者都变成了 [2, 3, 4]，可以计算 Loss。

训练

transfomers 内置的 Trainer 类已经够用，实践中基本只需重写其 compute_loss 方法。
TrainingArguments 与 Trainer 配套使用，其定义了绝大部分训练所需要的参数。

微调 LLM——以 Baichuan-7B 为例

输入构建

单轮对话

多轮对话

Label Mask

模型

模型加载（量化或非量化）

模型准备

Lora 配置

Label shift and Loss

训练

2023 年第 29 周总结

哈希

Comments NOTHING

取消回复