LLM 基础架构：Transformer 与注意力机制-深圳市維司達科技有限公司

LLM 基础架构：Transformer 与注意力机制

1. 技术分析

1.1 LLM 架构概述

LLM (Large Language Model) 基于 Transformer 架构：

LLM 架构 输入层 → Embedding → Transformer Blocks → 输出层 Transformer Block: Multi-Head Attention Feed Forward Network Layer Normalization Residual Connection

1.2 Transformer 核心组件

组件	作用	复杂度
Multi-Head Attention	捕捉不同位置关系	O(n²d)
Feed Forward	非线性变换	O(nd²)
LayerNorm	稳定训练	O(nd)
Residual	梯度传播	O(nd)

1.3 LLM 模型对比

模型	参数	架构	特点
GPT-3	175B	Decoder-only	通用能力强
PaLM	540B	Decoder-only	推理能力强
Llama	65B	Decoder-only	开源
T5	11B	Encoder-Decoder	多任务

2. 核心功能实现

2.1 Multi-Head Attention

import torch import torch.nn as nn import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def scaled_dot_product_attention(self, Q, K, V, mask=None): scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32)) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) output = torch.matmul(attn_weights, V) return output, attn_weights def split_heads(self, x, batch_size): return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) Q = self.split_heads(self.W_q(Q), batch_size) K = self.split_heads(self.W_k(K), batch_size) V = self.split_heads(self.W_v(V), batch_size) output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask) output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model) output = self.W_o(output) return output, attn_weights

2.2 Transformer Block

class TransformerBlock(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, mask=None): attn_output, _ = self.self_attn(x, x, x, mask) x = self.norm1(x + self.dropout(attn_output)) ff_output = self.feed_forward(x) x = self.norm2(x + self.dropout(ff_output)) return x class GPTModel(nn.Module): def __init__(self, vocab_size, d_model=768, num_heads=12, d_ff=3072, num_layers=12): super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) self.positional_encoding = self._create_positional_encoding(d_model) self.layers = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers) ]) self.fc = nn.Linear(d_model, vocab_size) def _create_positional_encoding(self, d_model, max_len=5000): position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) pe = torch.zeros(max_len, 1, d_model) pe[:, 0, 0::2] = torch.sin(position * div_term) pe[:, 0, 1::2] = torch.cos(position * div_term) return pe def forward(self, x): x = self.embedding(x) + self.positional_encoding[:x.size(1)] mask = torch.tril(torch.ones(x.size(1), x.size(1))).bool() for layer in self.layers: x = layer(x, mask) x = self.fc(x) return x

2.3 LLM 推理

class LLMInference: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer self.model.eval() def generate(self, prompt, max_len=100, temperature=1.0, top_k=50): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') with torch.no_grad(): for _ in range(max_len): outputs = self.model(input_ids) logits = outputs[:, -1, :] / temperature if top_k > 0: v, _ = torch.topk(logits, top_k) logits[logits < v[:, -1]] = float('-inf') probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) input_ids = torch.cat([input_ids, next_token], dim=1) if next_token.item() == self.tokenizer.eos_token_id: break return self.tokenizer.decode(input_ids[0], skip_special_tokens=True) class GPTDecoder: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer def beam_search(self, prompt, max_len=100, beam_size=5): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') beams = [(input_ids, 0.0)] for _ in range(max_len): new_beams = [] for beam, score in beams: outputs = self.model(beam) logits = outputs[:, -1, :] probs = F.log_softmax(logits, dim=-1) top_probs, top_indices = torch.topk(probs, beam_size) for i in range(beam_size): new_beam = torch.cat([beam, top_indices[:, i].unsqueeze(1)], dim=1) new_score = score + top_probs[:, i].item() new_beams.append((new_beam, new_score)) new_beams.sort(key=lambda x: x[1], reverse=True) beams = new_beams[:beam_size] if beams[0][0][0, -1].item() == self.tokenizer.eos_token_id: break best_beam = beams[0][0] return self.tokenizer.decode(best_beam[0], skip_special_tokens=True)

3. 性能对比

3.1 LLM 模型对比

模型	参数(B)	训练数据(TB)	推理速度(tokens/s)
GPT-3	175	45	200
PaLM	540	780	100
Llama-2	70	2	500
Mistral	7	0.8	1000

3.2 注意力机制对比

类型	复杂度	效果	适用场景
Full Attention	O(n²)	最好	短序列
Sparse Attention	O(n log n)	好	长序列
Linear Attention	O(n)	较好	超长序列

3.3 生成策略对比

策略	质量	多样性	速度
Greedy	中	低	快
Beam Search	高	低	慢
Top-K	高	中	中
Top-P	高	高	中

4. 最佳实践

4.1 LLM 选择

def select_llm(task_type, constraints): if constraints.get('open_source', False): return 'Llama-2' elif constraints.get('speed', False): return 'Mistral' else: return 'GPT-4' class LLMFactory: @staticmethod def create(config): if config['type'] == 'gpt': from transformers import GPT2LMHeadModel, GPT2Tokenizer return GPT2LMHeadModel.from_pretrained(config['model_name']) elif config['type'] == 'llama': from transformers import LlamaForCausalLM, LlamaTokenizer return LlamaForCausalLM.from_pretrained(config['model_name'])

4.2 LLM 部署

class LLMDeployer: def __init__(self, model, tokenizer, config): self.model = model self.tokenizer = tokenizer self.config = config def optimize(self): if self.config.get('quantize', False): self.model = self._quantize_model() if self.config.get('compile', False): self.model = torch.compile(self.model) def _quantize_model(self): from torch.ao.quantization import quantize_dynamic return quantize_dynamic(self.model, {torch.nn.Linear}) def serve(self): from fastapi import FastAPI app = FastAPI() @app.post('/generate') def generate(prompt: str): return {'response': self.model.generate(prompt)} return app