news 2026/4/23 13:10:44

大模型微调--MoELora

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
大模型微调--MoELora

文章目录

      • MOELoRA 的核心组件
      • MOE 在多任务学习中的作用
      • LoRA 在参数高效微调中的贡献
      • MOELoRA 的协同工作机制

https://arxiv.org/pdf/2310.18339
When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications


MOELoRA 的核心组件

MOELoRA 的核心思想建立在两个关键技术上:混合专家系统(MOE)和低秩自适应(LoRA)。MOE 负责处理多任务学习中的任务分配和专家协作,LoRA 则专注于参数高效的模型微调。

MOE 在多任务学习中的作用

MOE 结构通过动态路由机制将输入数据分配给不同的专家模块,每个专家专注于特定任务或数据子集。这种设计允许模型在不显著增加参数量的情况下,灵活处理多任务场景。MOE 的优势在于其能够根据任务复杂度自动调整专家资源的分配,提升模型在有限数据和计算资源下的表现。

LoRA 在参数高效微调中的贡献

LoRA 通过低秩矩阵分解技术,在预训练模型的基础上引入少量可训练参数,大幅降低微调阶段的资源消耗。具体实现中,LoRA 将权重更新 ΔW 分解为两个低秩矩阵的乘积(例如 ΔW = BA,其中 B 和 A 的秩远小于原权重矩阵)。这种方法既保留了预训练模型的知识,又实现了高效的任务适配。

MOELoRA 的协同工作机制

MOELoRA 将 MOE 的任务分配能力与 LoRA 的参数效率结合,形成分层优化结构。MOE 层负责识别任务类型并激活对应的专家模块,每个专家内部采用 LoRA 进行微调。这种设计既避免了多任务间的干扰,又通过共享基础模型参数减少了冗余。


https://github.com/liuqidong07/MOELoRA-peft/blob/master/src/MLoRA/peft/tuners/mmoelora.py

classMMOELoraLayer(LoraLayer):def__init__(self,in_features:int,out_features:int,expert_num:int):super().__init__(in_features,out_features)self.expert_num=expert_numdefupdate_layer(self,adapter_name,r,lora_alpha,lora_dropout,init_lora_weights):self.r[adapter_name]=r self.lora_alpha[adapter_name]=lora_alphaiflora_dropout>0.0:lora_dropout_layer=nn.Dropout(p=lora_dropout)else:lora_dropout_layer=nn.Identity()self.lora_dropout.update(nn.ModuleDict({adapter_name:lora_dropout_layer}))# Actual trainable parametersifr>0:self.lora_A.update(nn.ModuleDict({adapter_name:MMOELinearA(self.in_features,r,self.expert_num)}))self.lora_B.update(nn.ModuleDict({adapter_name:MMOELinearB(r,self.out_features,self.expert_num)}))self.scaling[adapter_name]=lora_alpha/rifinit_lora_weights:self.reset_lora_parameters(adapter_name)self.to(self.weight.device)defreset_lora_parameters(self,adapter_name):ifadapter_nameinself.lora_A.keys():# initialize A the same way as the default for nn.Linear and B to zeroforiinrange(self.expert_num):nn.init.normal_(self.lora_A[adapter_name].loraA[i].mlp.weight,mean=0.0,std=0.01)nn.init.zeros_(self.lora_B[adapter_name].loraB[i].mlp.weight)classMMOELoraLinear(nn.Linear,MMOELoraLayer):# Lora implemented in a dense layer# nn.Linear is the pretrained weights in LLM, MMOELoraLayer is the designed trainable Loradef__init__(self,adapter_name:str,in_features:int,out_features:int,r:int=0,lora_alpha:int=1,lora_dropout:float=0.0,fan_in_fan_out:bool=False,# Set this to True if the layer to replace stores weight like (fan_in, fan_out)**kwargs,):init_lora_weights=kwargs.pop("init_lora_weights",True)self.expert_num=kwargs.pop("expert_num",True)self.task_num=kwargs.pop("task_num",True)self.te_dim=kwargs.pop("task_embedding_dim",True)nn.Linear.__init__(self,in_features,out_features,**kwargs)MMOELoraLayer.__init__(self,in_features=in_features,out_features=out_features,expert_num=self.expert_num)# init the Gate networkself.lora_task_embedding=nn.ModuleDict({})self.lora_gate=nn.ModuleDict({})self.lora_task_embedding.update(nn.ModuleDict({adapter_name:nn.Embedding(self.task_num+1,self.te_dim)}))self.lora_gate.update(nn.ModuleDict({adapter_name:Gate(self.te_dim,self.expert_num)}))# Freezing the pre-trained weight matrixself.weight.requires_grad=Falseself.fan_in_fan_out=fan_in_fan_outiffan_in_fan_out:self.weight.data=self.weight.data.T nn.Linear.reset_parameters(self)self.update_layer(adapter_name,r,lora_alpha,lora_dropout,init_lora_weights)self.active_adapter=adapter_namedefmerge(self,task_id):ifself.active_adapternotinself.lora_A.keys():returnifself.merged:warnings.warn("Already merged. Nothing to do.")returnifself.r[self.active_adapter]>0:expert_weight=self.lora_gate[self.active_adapter](self.lora_task_embedding[self.active_adapter](task_id))foriinrange(self.expert_num):lora_A_weights=self.lora_A[self.active_adapter].loraA[i].mlp.weight lora_B_weights=self.lora_B[self.active_adapter].loraB[i].mlp.weight self.weight.data+=(transpose(lora_B_weights @ lora_A_weights,self.fan_in_fan_out,)*self.scaling[self.active_adapter]*expert_weight[...,i])self.merged=Truedefunmerge(self,task_id):ifself.active_adapternotinself.lora_A.keys():returnifnotself.merged:warnings.warn("Already unmerged. Nothing to do.")returnifself.r[self.active_adapter]>0:expert_weight=self.lora_gate[self.active_adapter](self.lora_task_embedding[self.active_adapter](task_id))foriinrange(self.expert_num):lora_A_weights=self.lora_A[self.active_adapter].loraA[i].mlp.weight lora_B_weights=self.lora_B[self.active_adapter].loraB[i].mlp.weight self.weight.data-=(transpose(lora_B_weights @ lora_A_weights,self.fan_in_fan_out,)*self.scaling[self.active_adapter]*expert_weight[...,i])self.merged=Falsedefforward(self,x:torch.Tensor,**kwargs):task_id=kwargs["task_id"]previous_dtype=x.dtypeifself.active_adapternotinself.lora_A.keys():# No adapter, directly use linearreturnF.linear(x,transpose(self.weight,self.fan_in_fan_out),bias=self.bias)ifself.disable_adapters:# No adapterifself.r[self.active_adapter]>0andself.merged:# merge the adapter to linearself.unmerge(task_id)result=F.linear(x,transpose(self.weight,self.fan_in_fan_out),bias=self.bias)elifself.r[self.active_adapter]>0andnotself.merged:# general lora processresult=F.linear(x,transpose(self.weight,self.fan_in_fan_out),bias=self.bias)x=x.to(self.lora_A[self.active_adapter].loraA[0].weight.dtype)expert_weight=self.lora_gate[self.active_adapter](self.lora_task_embedding[self.active_adapter](task_id))foriinrange(self.expert_num):result+=(# lora processself.lora_B[self.active_adapter].loraB[i](self.lora_A[self.active_adapter].loraA[i](self.lora_dropout[self.active_adapter](x)),)*self.scaling[self.active_adapter]*expert_weight[...,i].unsqueeze(-1).unsqueeze(0))else:result=F.linear(x,transpose(self.weight,self.fan_in_fan_out),bias=self.bias)result=result.to(previous_dtype)returnresultclassMMOELinearA(nn.Module):'''MMOE based LoRA block'''def__init__(self,in_features,out_features,expert_num)->None:super().__init__()self.expert_num=expert_num self.in_features,self.out_features=in_features,out_features self.loraA=nn.ModuleList([])assertself.out_features%self.expert_num==0# lora rank should be divided by expert numberself.r=self.out_features//self.expert_numfor_inrange(self.expert_num):self.loraA.append(Expert(self.in_features,self.r))defforward(self,x):'''input x is a vector, return output is a list'''outputs=[]foriinrange(self.expert_num):outputs.append(self.loraA[i](x))returnoutputsclassMMOELinearB(nn.Module):'''MMOE based LoRA block'''def__init__(self,in_features,out_features,expert_num)->None:super().__init__()self.expert_num=expert_num self.in_features,self.out_features=in_features,out_features self.loraB=nn.ModuleList([])assertself.in_features%self.expert_num==0self.r=self.in_features//self.expert_numfor_inrange(self.expert_num):self.loraB.append(Expert(self.r,self.out_features))defforward(self,x):'''input x is a list, return output is also a list'''outputs=[]foriinrange(self.expert_num):outputs.append(self.loraB[i](x[i]))returnoutputsclassExpert(nn.Module):def__init__(self,in_features,out_features):super().__init__()self.in_features,self.out_features=in_features,out_features self.mlp=nn.Linear(self.in_features,self.out_features,bias=False)self.weight=self.mlp.weightdefforward(self,x):# LoRA A or B blocky=self.mlp(x)returnyclassGate(nn.Module):def__init__(self,input_size,expert_num):super().__init__()# 使用embedding来代替线性层self.GateL=nn.Linear(input_size,expert_num,bias=False)self.act=nn.Softmax(dim=1)# 第0维为batch sizedefforward(self,x):y=self.GateL(x)y=self.act(y)returny
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/23 11:15:48

基于深度学习的数码商城多模态商品推荐系统设计与实现开题报告

黄河科技学院毕业设计开题报告表课题名称课题来源汉字课题类型字母组合,如DX指导教师学生姓名专 业学 号一、调研资料的准备[1]潘涛,王柳,董冉冉.基于Vue.js框架的网上商城管理系统的设计与实现[J].科技与创新,2023(13):8-10.[2]李亚君. 基于SSM框架的B2C电子…

作者头像 李华
网站建设 2026/4/19 4:02:40

Open-AutoGLM手势系统崩溃前兆:4个预警信号你必须立即处理

第一章:Open-AutoGLM手势系统崩溃前兆概述在深度集成视觉识别与边缘计算的现代交互系统中,Open-AutoGLM 手势识别框架因其高响应性与低延迟特性被广泛应用于智能终端设备。然而,在实际部署过程中,系统可能在持续运行后出现性能劣化…

作者头像 李华
网站建设 2026/4/22 23:46:51

LangFlow能否实现跨模型协作?多LLM协同推理实验

LangFlow能否实现跨模型协作?多LLM协同推理实验 在大语言模型迅速普及的今天,我们早已过了“一个提示词打天下”的时代。面对复杂的业务需求——比如既要理解用户意图、又要调用工具执行操作、还要生成符合品牌风格的回复——单靠一个LLM已经捉襟见肘。于…

作者头像 李华
网站建设 2026/4/16 17:23:44

紧急修复指南:Open-AutoGLM滑动卡顿/失效的3大根源及应对方案

第一章:Open-AutoGLM滑动操作失效修复在使用 Open-AutoGLM 框架进行移动端自动化测试时,部分用户反馈滑动(swipe)操作无法正常触发,导致页面交互流程中断。该问题通常出现在高分辨率设备或特定 Android 系统版本中&…

作者头像 李华
网站建设 2026/4/23 11:32:19

Open-AutoGLM长按事件不触发?7大常见原因及对应修复方案

第一章:Open-AutoGLM 长按功能异常解决在使用 Open-AutoGLM 框架开发智能对话系统时,部分用户反馈在移动端触发长按操作时出现功能无响应或误触的问题。该问题主要源于事件监听机制与手势识别模块之间的冲突,特别是在触摸事件未正确传递至 GL…

作者头像 李华
网站建设 2026/4/18 2:17:00

揭秘Open-AutoGLM缩放卡顿真相:5个常被忽略的触发条件与解决方案

第一章:Open-AutoGLM 缩放手势无反应处理在使用 Open-AutoGLM 框架开发可视化交互应用时,部分用户反馈在移动端或触控设备上进行双指缩放操作时,图形界面无法响应手势事件。该问题通常与事件监听器配置、手势识别优先级及 DOM 元素的触摸行为…

作者头像 李华