news 2026/5/12 3:03:34

问答系统:从检索到生成式模型

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
问答系统:从检索到生成式模型

问答系统:从检索到生成式模型

1. 技术分析

1.1 问答系统类型

问答系统可分为多种类型:

问答系统分类 检索式: 从知识库中检索答案 抽取式: 从文本中抽取答案片段 生成式: 直接生成答案 多模态: 结合文本和视觉

1.2 问答系统架构对比

类型架构特点代表模型
检索式TF-IDF/BM25简单快速Elasticsearch
抽取式BERT准确BERT-QA
生成式T5/GPT灵活T5-QA
多模态ViLT多模态ViLT-QA

1.3 QA 任务类型

QA 任务分类 SQuAD: 抽取式问答 HotpotQA: 多跳问答 TriviaQA: 开放域问答 VQA: 视觉问答

2. 核心功能实现

2.1 检索式问答

import torch import torch.nn as nn import numpy as np from rank_bm25 import BM25Okapi class RetrievalQA: def __init__(self, documents): self.documents = documents self.tokenized_docs = [doc.lower().split() for doc in documents] self.bm25 = BM25Okapi(self.tokenized_docs) def retrieve(self, query, top_k=5): tokenized_query = query.lower().split() scores = self.bm25.get_scores(tokenized_query) top_indices = np.argsort(scores)[::-1][:top_k] return [(self.documents[i], scores[i]) for i in top_indices] def answer(self, query, top_k=1): results = self.retrieve(query, top_k) return results[0][0] if results else None class DenseRetrieval(nn.Module): def __init__(self, model_name='bert-base-uncased'): super().__init__() from transformers import BertModel, BertTokenizer self.model = BertModel.from_pretrained(model_name) self.tokenizer = BertTokenizer.from_pretrained(model_name) def encode(self, texts): inputs = self.tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors='pt' ) outputs = self.model(**inputs) embeddings = outputs.last_hidden_state[:, 0, :] return embeddings def retrieve(self, query, documents, top_k=5): query_embedding = self.encode([query]) doc_embeddings = self.encode(documents) scores = torch.matmul(query_embedding, doc_embeddings.T).squeeze(0) top_indices = torch.argsort(scores, descending=True)[:top_k] return [(documents[i], scores[i].item()) for i in top_indices]

2.2 抽取式问答

class ExtractiveQA(nn.Module): def __init__(self, model_name='bert-base-uncased'): super().__init__() from transformers import BertForQuestionAnswering self.model = BertForQuestionAnswering.from_pretrained(model_name) def forward(self, input_ids, attention_mask): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) return outputs.start_logits, outputs.end_logits def predict(self, question, context): from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') inputs = tokenizer( question, context, padding=True, truncation=True, max_length=512, return_tensors='pt' ) with torch.no_grad(): start_logits, end_logits = self.forward(inputs['input_ids'], inputs['attention_mask']) start_idx = torch.argmax(start_logits) end_idx = torch.argmax(end_logits) tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) answer = tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx+1]) return answer class QAWithRetrieval: def __init__(self, documents): self.retriever = DenseRetrieval() self.extractor = ExtractiveQA() self.documents = documents def answer(self, question): candidates = self.retriever.retrieve(question, self.documents, top_k=3) for doc, _ in candidates: answer = self.extractor.predict(question, doc) if answer.strip(): return answer return "No answer found"

2.3 生成式问答

class GenerativeQA(nn.Module): def __init__(self, model_name='t5-base'): super().__init__() from transformers import T5ForConditionalGeneration, T5Tokenizer self.model = T5ForConditionalGeneration.from_pretrained(model_name) self.tokenizer = T5Tokenizer.from_pretrained(model_name) def generate(self, question, context=None): if context: input_text = f"question: {question} context: {context}" else: input_text = f"question: {question}" inputs = self.tokenizer( input_text, padding=True, truncation=True, max_length=512, return_tensors='pt' ) with torch.no_grad(): outputs = self.model.generate( **inputs, max_length=100, num_beams=5, early_stopping=True ) answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True) return answer class OpenDomainQA: def __init__(self, retriever, generator): self.retriever = retriever self.generator = generator def answer(self, question, documents): candidates = self.retriever.retrieve(question, documents, top_k=3) context = "\n".join([doc for doc, _ in candidates]) return self.generator.generate(question, context)

3. 性能对比

3.1 问答系统类型对比

类型准确率灵活性训练数据推理速度
检索式很快
抽取式
生成式很高

3.2 不同 QA 数据集表现

数据集抽取式生成式检索+生成
SQuAD v192%88%90%
SQuAD v283%79%81%
HotpotQA75%72%78%

3.3 模型大小影响

模型参数F1推理时间(ms)
BERT-base110M89%50
BERT-large340M93%150
T5-base220M87%100
T5-large770M91%300

4. 最佳实践

4.1 问答系统选择

def select_qa_system(task_type, data_size): if task_type == 'retrieval': return RetrievalQA([]) elif task_type == 'extractive': return ExtractiveQA() elif task_type == 'generative': return GenerativeQA() else: return QAWithRetrieval([]) class QASystemFactory: @staticmethod def create(config): if config['type'] == 'retrieval': return RetrievalQA(config['documents']) elif config['type'] == 'extractive': return ExtractiveQA(config['model_name']) elif config['type'] == 'generative': return GenerativeQA(config['model_name']) elif config['type'] == 'hybrid': return OpenDomainQA( DenseRetrieval(config['retriever_model']), GenerativeQA(config['generator_model']) )

4.2 QA 系统训练流程

class QATrainer: def __init__(self, model, optimizer, scheduler, loss_fn): self.model = model self.optimizer = optimizer self.scheduler = scheduler self.loss_fn = loss_fn def train_step(self, batch): self.optimizer.zero_grad() input_ids = batch['input_ids'] attention_mask = batch['attention_mask'] start_positions = batch['start_positions'] end_positions = batch['end_positions'] start_logits, end_logits = self.model(input_ids, attention_mask) loss = (self.loss_fn(start_logits, start_positions) + self.loss_fn(end_logits, end_positions)) / 2 loss.backward() self.optimizer.step() self.scheduler.step() return loss.item() def evaluate(self, dataloader): self.model.eval() total_f1 = 0 with torch.no_grad(): for batch in dataloader: input_ids = batch['input_ids'] attention_mask = batch['attention_mask'] start_positions = batch['start_positions'] end_positions = batch['end_positions'] start_logits, end_logits = self.model(input_ids, attention_mask) start_pred = torch.argmax(start_logits, dim=1) end_pred = torch.argmax(end_logits, dim=1) for i in range(len(start_pred)): tp = ((start_pred[i] >= start_positions[i]) & (end_pred[i] <= end_positions[i])).sum().item() fp = ((start_pred[i] < start_positions[i]) | (end_pred[i] > end_positions[i])).sum().item() fn = ((start_pred[i] > start_positions[i]) | (end_pred[i] < end_positions[i])).sum().item() precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 total_f1 += f1 return total_f1 / len(dataloader)

5. 总结

问答系统是 NLP 的重要应用:

  1. 检索式:简单快速,适合小规模知识库
  2. 抽取式:准确,适合有上下文的场景
  3. 生成式:灵活,可生成自然语言答案
  4. 混合式:结合检索和生成,效果最佳

对比数据如下:

  • 生成式在开放域问答中表现更好
  • 抽取式在限定域问答中更准确
  • 推荐使用混合架构
  • 预训练模型是 QA 系统的基础
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/12 3:00:33

国际空间站工程知识共享:从太空协作到地面工程实践的启示

1. 国际空间站&#xff1a;一个工程师眼中的知识共享金矿作为一名在航天工程领域摸爬滚打了十几年的工程师&#xff0c;我常常被问到一个问题&#xff1a;耗资巨大的国际空间站&#xff08;ISS&#xff09;&#xff0c;除了那些遥不可及的太空探索梦想&#xff0c;到底给我们这…

作者头像 李华
网站建设 2026/5/12 3:00:06

芯片产业的地缘政治博弈:从全球化理想到国家战略现实

1. 项目概述&#xff1a;一场关于芯片“国籍”的深度对话十年前&#xff0c;在EE Times的一篇专栏文章中&#xff0c;资深记者Junko Yoshida与全球半导体联盟&#xff08;GSA&#xff09;亚太区执行董事Jeremy Wang进行了一场发人深省的对话。核心议题直指产业本质&#xff1a;…

作者头像 李华
网站建设 2026/5/12 2:55:44

PGlite Explorer:在VS Code中无缝管理轻量级PostgreSQL数据库

1. 项目概述&#xff1a;在编辑器里直接管理你的PGlite数据库如果你和我一样&#xff0c;日常开发离不开VS Code或者Cursor&#xff0c;同时又经常需要和本地数据库打交道&#xff0c;那你肯定体会过那种频繁切换窗口的割裂感。写一段代码&#xff0c;切到数据库GUI工具里查个数…

作者头像 李华
网站建设 2026/5/12 2:48:10

魔兽争霸3终极优化指南:12个免费插件让你的经典游戏焕发新生

魔兽争霸3终极优化指南&#xff1a;12个免费插件让你的经典游戏焕发新生 【免费下载链接】WarcraftHelper Warcraft III Helper , support 1.20e, 1.24e, 1.26a, 1.27a, 1.27b 项目地址: https://gitcode.com/gh_mirrors/wa/WarcraftHelper 还在为魔兽争霸3在现代电脑上…

作者头像 李华