news 2026/5/14 23:41:32

LLM 基础架构:Transformer 与注意力机制

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
LLM 基础架构:Transformer 与注意力机制

LLM 基础架构:Transformer 与注意力机制

1. 技术分析

1.1 LLM 架构概述

LLM (Large Language Model) 基于 Transformer 架构:

LLM 架构 输入层 → Embedding → Transformer Blocks → 输出层 Transformer Block: Multi-Head Attention Feed Forward Network Layer Normalization Residual Connection

1.2 Transformer 核心组件

组件作用复杂度
Multi-Head Attention捕捉不同位置关系O(n²d)
Feed Forward非线性变换O(nd²)
LayerNorm稳定训练O(nd)
Residual梯度传播O(nd)

1.3 LLM 模型对比

模型参数架构特点
GPT-3175BDecoder-only通用能力强
PaLM540BDecoder-only推理能力强
Llama65BDecoder-only开源
T511BEncoder-Decoder多任务

2. 核心功能实现

2.1 Multi-Head Attention

import torch import torch.nn as nn import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def scaled_dot_product_attention(self, Q, K, V, mask=None): scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32)) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) output = torch.matmul(attn_weights, V) return output, attn_weights def split_heads(self, x, batch_size): return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) Q = self.split_heads(self.W_q(Q), batch_size) K = self.split_heads(self.W_k(K), batch_size) V = self.split_heads(self.W_v(V), batch_size) output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask) output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model) output = self.W_o(output) return output, attn_weights

2.2 Transformer Block

class TransformerBlock(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, mask=None): attn_output, _ = self.self_attn(x, x, x, mask) x = self.norm1(x + self.dropout(attn_output)) ff_output = self.feed_forward(x) x = self.norm2(x + self.dropout(ff_output)) return x class GPTModel(nn.Module): def __init__(self, vocab_size, d_model=768, num_heads=12, d_ff=3072, num_layers=12): super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) self.positional_encoding = self._create_positional_encoding(d_model) self.layers = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers) ]) self.fc = nn.Linear(d_model, vocab_size) def _create_positional_encoding(self, d_model, max_len=5000): position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) pe = torch.zeros(max_len, 1, d_model) pe[:, 0, 0::2] = torch.sin(position * div_term) pe[:, 0, 1::2] = torch.cos(position * div_term) return pe def forward(self, x): x = self.embedding(x) + self.positional_encoding[:x.size(1)] mask = torch.tril(torch.ones(x.size(1), x.size(1))).bool() for layer in self.layers: x = layer(x, mask) x = self.fc(x) return x

2.3 LLM 推理

class LLMInference: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer self.model.eval() def generate(self, prompt, max_len=100, temperature=1.0, top_k=50): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') with torch.no_grad(): for _ in range(max_len): outputs = self.model(input_ids) logits = outputs[:, -1, :] / temperature if top_k > 0: v, _ = torch.topk(logits, top_k) logits[logits < v[:, -1]] = float('-inf') probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) input_ids = torch.cat([input_ids, next_token], dim=1) if next_token.item() == self.tokenizer.eos_token_id: break return self.tokenizer.decode(input_ids[0], skip_special_tokens=True) class GPTDecoder: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer def beam_search(self, prompt, max_len=100, beam_size=5): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') beams = [(input_ids, 0.0)] for _ in range(max_len): new_beams = [] for beam, score in beams: outputs = self.model(beam) logits = outputs[:, -1, :] probs = F.log_softmax(logits, dim=-1) top_probs, top_indices = torch.topk(probs, beam_size) for i in range(beam_size): new_beam = torch.cat([beam, top_indices[:, i].unsqueeze(1)], dim=1) new_score = score + top_probs[:, i].item() new_beams.append((new_beam, new_score)) new_beams.sort(key=lambda x: x[1], reverse=True) beams = new_beams[:beam_size] if beams[0][0][0, -1].item() == self.tokenizer.eos_token_id: break best_beam = beams[0][0] return self.tokenizer.decode(best_beam[0], skip_special_tokens=True)

3. 性能对比

3.1 LLM 模型对比

模型参数(B)训练数据(TB)推理速度(tokens/s)
GPT-317545200
PaLM540780100
Llama-2702500
Mistral70.81000

3.2 注意力机制对比

类型复杂度效果适用场景
Full AttentionO(n²)最好短序列
Sparse AttentionO(n log n)长序列
Linear AttentionO(n)较好超长序列

3.3 生成策略对比

策略质量多样性速度
Greedy
Beam Search
Top-K
Top-P

4. 最佳实践

4.1 LLM 选择

def select_llm(task_type, constraints): if constraints.get('open_source', False): return 'Llama-2' elif constraints.get('speed', False): return 'Mistral' else: return 'GPT-4' class LLMFactory: @staticmethod def create(config): if config['type'] == 'gpt': from transformers import GPT2LMHeadModel, GPT2Tokenizer return GPT2LMHeadModel.from_pretrained(config['model_name']) elif config['type'] == 'llama': from transformers import LlamaForCausalLM, LlamaTokenizer return LlamaForCausalLM.from_pretrained(config['model_name'])

4.2 LLM 部署

class LLMDeployer: def __init__(self, model, tokenizer, config): self.model = model self.tokenizer = tokenizer self.config = config def optimize(self): if self.config.get('quantize', False): self.model = self._quantize_model() if self.config.get('compile', False): self.model = torch.compile(self.model) def _quantize_model(self): from torch.ao.quantization import quantize_dynamic return quantize_dynamic(self.model, {torch.nn.Linear}) def serve(self): from fastapi import FastAPI app = FastAPI() @app.post('/generate') def generate(prompt: str): return {'response': self.model.generate(prompt)} return app

5. 总结

LLM 是当前 NLP 领域的核心技术:

  1. Transformer:LLM 的基础架构
  2. 注意力机制:捕捉文本中的关系
  3. 生成策略:影响输出质量和多样性
  4. 模型选择:根据需求选择合适的模型

对比数据如下:

  • Llama-2 是最佳的开源模型
  • GPT-4 在综合能力上领先
  • 量化可显著提升推理速度
  • 推荐根据任务需求选择模型
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/14 23:31:22

通过TaotokenCLI工具一键配置多款AI开发工具的运行环境

&#x1f680; 告别海外账号与网络限制&#xff01;稳定直连全球优质大模型&#xff0c;限时半价接入中。 &#x1f449; 点击领取海量免费额度 通过TaotokenCLI工具一键配置多款AI开发工具的运行环境 在团队协作或跨项目开发中&#xff0c;为不同的AI开发工具&#xff08;如C…

作者头像 李华
网站建设 2026/5/14 23:29:32

对比直接购买与使用Taotoken Token Plan套餐的实际成本节省体会

&#x1f680; 告别海外账号与网络限制&#xff01;稳定直连全球优质大模型&#xff0c;限时半价接入中。 &#x1f449; 点击领取海量免费额度 对比直接购买与使用Taotoken Token Plan套餐的实际成本节省体会 1. 从按量计费到预付费计划的转变 作为一名长期在个人项目中集成…

作者头像 李华
网站建设 2026/5/14 23:21:22

手把手配置80km穿云箭:从服务端部署到客户端使用的完整教程

一、什么是内网端口映射工具 内网端口映射工具是一种内网穿透技术&#xff0c;能够将位于局域网内部的设备IP和端口“映射”到公网&#xff0c;生成一个可访问的外网地址。简单来说&#xff0c;即使你的电脑没有公网IP、无需路由器配置&#xff0c;也能通过这类工具让外网用户…

作者头像 李华
网站建设 2026/5/14 23:21:21

无公网IP环境下的宽带端口映射:80km穿云箭部署与性能测试

一、什么是宽带端口映射宽带端口映射是通过家庭/企业宽带路由器&#xff0c;将外网访问请求转发到内网指定设备的技术。形象地说&#xff0c;路由器就像“小区保安”——外部访客找到小区门口&#xff08;公网IP&#xff09;&#xff0c;保安根据门牌号&#xff08;端口&#x…

作者头像 李华
网站建设 2026/5/14 23:20:48

易代理分层IP池搭建,高并发业务弹性扩容方案

高并发场景下&#xff0c;普通单一IP池极易出现资源抢占、带宽拥堵、节点击穿等问题。企业业务放量后&#xff0c;批量爬虫、矩阵账号、多线程访问同时运行&#xff0c;对IP并发量、切换速度、纯净度要求同步提升。易代理针对高并发业务痛点&#xff0c;搭建分层式IP资源池&…

作者头像 李华
网站建设 2026/5/14 23:17:37

大语言模型在CTF攻防实战中的能力边界与未来展望

1. 项目概述&#xff1a;当大语言模型遇上CTF夺旗赛最近几年&#xff0c;大语言模型&#xff08;LLM&#xff09;在代码生成、逻辑推理和问题解决方面的能力突飞猛进&#xff0c;这让我这个在网络安全领域摸爬滚打了十多年的老手&#xff0c;产生了一个既兴奋又略带警惕的想法&…

作者头像 李华