一、LoRA的核心思想 LoRA,全称 Low-Rank Adaptation of Large Language Models ,是一种在 大模型中进行高效微调 的方法,目标是 只训练极少数参数 就能让模型适应新任务,避免重新训练整个大模型,从而可以在没有充足GPU显存的情况下快速在自己的数据集上对大模型做微调。
在Transformer、ViT、GPT等模型中,很多计算都包含线性层: $$y = W x$$
$$ W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} $$
LoRA 的做法是:不直接更新大模型参数W ,而是在其旁边插入一个低秩矩阵BA ,作为可训练的残差项: $$y = W x + BAx$$
其中: $$ A \in \mathbb{R}^{r \times d_{\text{in}}} $$ $$ B \in \mathbb{R}^{d_{\text{out}} \times r} $$ $$ r \ll d_{\text{in}}, d_{\text{out}} $$
原先微调需要更新整个$W$,其参数量为$\text{Param}(W) = d_{\text{out}} \times d_{\text{in}}$,使用LoRA后,$B A$的参数量仅为$\text{Param}{\text{LoRA}} = r \times d {\text{in}} + d_{\text{out}} \times r = r (d_{\text{in}} + d_{\text{out}})$
使用PyTorch实现LoRA类,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 class LoRA (nn.Module): def __init__ (self, in_features, out_features, rank ): super ().__init__() self .rank = rank self .A = nn.Linear(in_features, rank, bias=False ) self .B = nn.Linear(rank, out_features, bias=False ) self .A.weight.data.normal_(mean=0.0 , std=0.02 ) self .B.weight.data.zero_() def forward (self, x ): return self .B(self .A(x))
二、如何将LoRA注入到现有的LLM中? 下面的代码实现了这一功能:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def apply_lora (model, rank=16 ): for name, module in model.named_modules(): if isinstance (module, nn.Linear) and module.weight.shape[0 ] == module.weight.shape[1 ]: lora = LoRA(module.weight.shape[0 ], module.weight.shape[1 ], rank=rank).to(model.device) setattr (module, "lora" , lora) original_forward = module.forward def forward_with_lora (x, layer1=original_forward, layer2=lora ): return layer1(x) + layer2(x) module.forward = forward_with_lora
举个简单模型的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 class TestModel (nn.Module): def __init__ (self ): super ().__init__() self .linear = nn.Linear(1024 , 1024 ) @property def device (self ): return next (self .parameters()).device def forward (self, x ): return self .linear(x) model = TestModel() print (model)
打印原始模型的结构:
1 2 3 TestModel( (linear): Linear(in_features=1024, out_features=1024, bias=True) )
这表明TestModel有一个成员变量linear,是一个标准的nn.Linear层
注入LoRA:
1 2 apply_lora(model) print (model)
打印注入LoRA后的model:
1 2 3 4 5 6 7 8 9 TestModel( (linear): Linear( in_features=1024, out_features=1024, bias=True (lora): LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) ) )
可以看到,lora层已经成功注入。
lora模块被注入到了nn.Linear中,成为nn.Linear这个module的一个成员变量。
我们可以打印模型的每一层:
1 2 for name, module in model.named_modules(): print (f"{name} : {module.__class__.__name__} " )
1 2 3 4 5 : TestModel linear: Linear linear.lora: LoRA linear.lora.A: Linear linear.lora.B: Linear
三、LoRA权重的加载与保存 因为训练时只更新LoRA的参数,因此在保存和加载模型权重时,只需要处理更新的这部分LoRA参数。
首先,注入lora的model为:
1 2 3 4 5 6 7 8 9 TestModel( (linear): Linear( in_features=1024, out_features=1024, bias=True (lora): LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) ) )
递归遍历打印模型中的所有模块:
1 2 for name, module in model.named_modules(): print (name,':' ,module)
如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 : TestModel( (linear): Linear( in_features=1024, out_features=1024, bias=True (lora): LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) ) ) linear : Linear( in_features=1024, out_features=1024, bias=True (lora): LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) ) linear.lora : LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) linear.lora.A : Linear(in_features=1024, out_features=16, bias=False) linear.lora.B : Linear(in_features=16, out_features=1024, bias=False)
可以看到,总共5个子模块。
我们只关心拥有lora属性的模块:
1 2 3 4 5 6 7 for name, module in model.named_modules(): if hasattr (module, 'lora' ): print (name,"------" ,module)
输出:
1 2 3 4 5 6 7 linear ------ Linear( in_features=1024, out_features=1024, bias=True (lora): LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) ) )
可以看到,只有第二个子模块linear具有lora属性,在模型训练时,也只有这一层的参数会被更新。
因此我们只需要保存linear.lora层的权重即可:
1 2 3 4 5 6 7 8 9 10 def save_lora (model, path ): state_dict = {} for name, module in model.named_modules(): if hasattr (module, 'lora' ): for k, v in module.lora.state_dict().items(): state_dict[f"{name} .lora.{k} " ] = v torch.save(state_dict, path) print (f"[LoRA] Saved {len (state_dict)} params to: {path} " ) save_lora(model, "lora.pth" )
加载保存的”lora.pth”,并解析其结构:
1 2 3 lora = torch.load("lora.pth" ) for k, v in lora.items(): print (k, v.shape)
1 2 linear.lora.A.weight torch.Size([16, 1024]) linear.lora.B.weight torch.Size([1024, 16])
相应地,在加载训练好的模型权重时,也只是加载lora层的权重:
1 2 3 4 5 6 7 8 9 10 11 def load_lora (model, path ): state_dict = torch.load(path, map_location=model.device) for name, module in model.named_modules(): if hasattr (module, 'lora' ): lora_state = {k.replace(f'{name} .lora.' , '' ): v for k, v in state_dict.items() if f'{name} .lora.' in k} for k, v in lora_state.items(): print (k,'----' ,v.shape) print (module.lora) module.lora.load_state_dict(lora_state)
1 load_lora(model, "lora.pth" )
加载时的调试输出信息:
1 2 3 4 5 6 A.weight ---- torch.Size([16, 1024]) B.weight ---- torch.Size([1024, 16]) LoRA( (A): Linear(in_features=1024, out_features=16, bias=False) (B): Linear(in_features=16, out_features=1024, bias=False) )
四、训练LoRA 这里将lora注入到MIniMind模型后,直接复用SFT的数据加载器和训练函数,相应的代码和SFT保持一致。
来看一下注入lora前后的训练参数量变化:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 model, tokenizer = init_model(lm_config) apply_lora(model) total_params = sum (p.numel() for p in model.parameters()) lora_params_count = sum (p.numel() for name, p in model.named_parameters() if 'lora' in name) if not ddp or dist.get_rank() == 0 : print (f"LLM 总参数量: {total_params} " ) print (f"LoRA 参数量: {lora_params_count} " ) print (f"LoRA 参数占比: {lora_params_count / total_params * 100 :.2 f} %" ) for name, param in model.named_parameters(): if 'lora' not in name: param.requires_grad = False lora_params = [] for name, param in model.named_parameters(): if 'lora' in name: lora_params.append(param) optimizer = optim.AdamW(lora_params, lr=args.learning_rate) train_ds = SFTDataset(args.data_path, tokenizer, max_length=args.max_seq_len) train_sampler = DistributedSampler(train_ds) if ddp else None train_loader = DataLoader( train_ds, batch_size=args.batch_size, pin_memory=True , drop_last=False , shuffle=False , num_workers=args.num_workers, sampler=train_sampler ) scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16' , 'bfloat16' ])) iter_per_epoch = len (train_loader) for epoch in range (args.epochs): train_epoch(epoch, wandb)
输出:
1 2 3 LLM 总参数量: 26092032 LoRA 参数量: 262144 LoRA 参数占比: 1.00%