保姆级教程：用通义千问3-14B和Langchain开发对话应用-深圳市維司達科技有限公司

保姆级教程：用通义千问3-14B和Langchain开发对话应用

1. 引言

1.1 学习目标

本文将带你从零开始，使用通义千问3-14B模型与LangChain框架构建一个本地可运行的智能对话应用。你将掌握：

如何部署 Qwen3-14B 模型并启用双模式推理（Thinking / Non-thinking）
集成 Ollama + Ollama WebUI 实现可视化交互
使用 LangChain 构建具备记忆、工具调用能力的对话系统
完整的环境配置、依赖安装、代码实现与优化建议

最终成果是一个支持长上下文理解、函数调用、多语言翻译的高可用对话机器人。

1.2 前置知识

建议具备以下基础：

Python 编程经验（熟悉pip和虚拟环境）
对 LLM 推理流程有基本了解（如 prompt、token、context length）
熟悉命令行操作

1.3 教程价值

本教程结合了当前最实用的开源组合：Qwen3-14B（Apache 2.0 可商用） + Ollama（轻量部署） + LangChain（灵活编排），提供一条低成本、高性能、易扩展的技术路径，适合企业原型验证或个人项目落地。

2. 环境准备

2.1 硬件要求

组件	最低配置	推荐配置
GPU	RTX 3090 (24GB)	RTX 4090 (24GB) 或 A100 (40/80GB)
显存	≥24GB FP16	≥24GB 支持 FP8 量化
CPU	8核以上	16核以上
内存	32GB	64GB
存储	100GB SSD	200GB NVMe（用于缓存模型）

提示：Qwen3-14B 全精度模型约 28GB，FP8 量化后为 14GB，RTX 4090 可全速运行。

2.2 软件依赖

# 创建虚拟环境 conda create -n qwen3-chat python=3.11 -y conda activate qwen3-chat # 安装 PyTorch（以 CUDA 12.1 为例） pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121 # 安装 Transformers 和 Accelerate pip install transformers==4.39.0 accelerate==0.27.2 # 安装量化支持库（GPTQ-for-LLaMa） pip install auto-gptq==0.5.0 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu121 # 安装 LangChain 核心库 pip install langchain langchain-community langchain-core langchain-text-splitters # 安装向量数据库支持（可选） pip install chromadb # 安装 FastAPI（用于 API 服务） pip install fastapi uvicorn python-multipart

3. 模型部署：Ollama + Ollama WebUI

3.1 安装 Ollama

Ollama 是目前最简洁的大模型本地运行工具，支持一键拉取 Qwen3-14B。

# 下载并安装 Ollama（Linux/macOS） curl -fsSL https://ollama.com/install.sh | sh # 启动服务 ollama serve

Windows 用户请前往 https://ollama.com/download 下载桌面版。

3.2 加载 Qwen3-14B 模型

# 拉取官方 GPTQ 量化版本（节省显存） ollama pull qwen:14b-gptq # 或者使用 FP8 量化（性能更强） ollama pull qwen:14b-fp8

说明：qwen:14b-gptq是社区优化的 4-bit 量化版本，显存占用约 10GB；fp8版本保留更高精度，适合推理任务。

3.3 启用双模式推理

创建自定义 Modelfile，支持切换 Thinking 模式：

FROM qwen:14b-fp8 # 设置默认参数 PARAMETER temperature 0.7 PARAMETER num_ctx 131072 # 支持 128K 上下文 PARAMETER stop <think> PARAMETER stop </think> # 自定义模板以识别 thinking 模式 TEMPLATE """{{ if .System }}<|system|> {{ .System }}<|end|> {{ end }}{{ if .Prompt }}<|user|> {{ .Prompt }}<|end|> {{ end }}<|assistant|> {{ .Response }}<|end|>"""

保存为Modelfile.thinking，然后构建：

ollama create qwen3-14b-think -f Modelfile.thinking

现在你可以通过不同模型名启动两种模式：

qwen:14b-fp8→ 快速响应（Non-thinking）
qwen3-14b-think→ 深度推理（Thinking）

3.4 安装 Ollama WebUI

提供图形化界面，便于测试和调试。

git clone https://github.com/ollama-webui/ollama-webui.git cd ollama-webui docker-compose up -d

访问http://localhost:3000即可与模型对话，支持历史记录、导出、分享等功能。

4. 对话系统开发：LangChain 集成

4.1 初始化 LangChain 连接

LangChain 提供Ollama封装类，可直接连接本地服务。

from langchain_community.llms import Ollama from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # 初始化 LLM（非思考模式） llm = Ollama( model="qwen:14b-fp8", temperature=0.7, num_ctx=131072, # 128K context ) # 构建 Prompt 模板 prompt = ChatPromptTemplate.from_messages([ ("system", "你是通义千问3-14B，一个强大的中文对话模型。请用清晰、准确的语言回答问题。"), ("human", "{question}") ]) # 输出解析器 output_parser = StrOutputParser() # 构建链 chain = prompt | llm | output_parser

4.2 添加对话记忆

使用ConversationBufferWindowMemory保留最近 N 轮对话。

from langchain.memory import ConversationBufferWindowMemory from langchain.chains import LLMChain memory = ConversationBufferWindowMemory(k=5) # 修改 Prompt 模板以包含历史 prompt_with_history = ChatPromptTemplate.from_messages([ ("system", "你是通义千问3-14B，请根据以下历史对话回答用户问题。\n{history}"), ("human", "{question}") ]) # 动态填充 history def get_response(question): history = memory.load_memory_variables({})["history"] response = chain.invoke({ "history": history, "question": question }) memory.save_context({"input": question}, {"output": response}) return response

4.3 支持函数调用（Function Calling）

Qwen3 支持 JSON Schema 格式的函数调用，可用于天气查询、数据库检索等场景。

import json # 定义工具描述 tools = [ { "type": "function", "function": { "name": "get_weather", "description": "获取指定城市的天气信息", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "城市名称"} }, "required": ["city"] } } } ] # 在 prompt 中注入 tool definition tool_prompt = ChatPromptTemplate.from_messages([ ("system", f"""你是一个助手，可以使用以下工具： {json.dumps(tools, ensure_ascii=False, indent=2)} 如果需要调用工具，请输出格式： <tool_call>{{"name": "get_weather", "arguments": {{"city": "北京"}}}}</tool_call>"""), ("human", "{question}") ]) tool_chain = tool_prompt | llm | output_parser

注意：需在 Ollama 中启用tool_call支持，可通过修改 Modelfile 添加PARAMETER tool_choice auto。

4.4 实现 Agent 工作流

使用 LangChain 的create_react_agent构建自主决策 Agent。

from langchain.agents import create_react_agent, AgentExecutor from langchain.tools import Tool # 定义真实工具函数 def fetch_weather(city: str) -> str: return f"{city}今天晴天，气温20℃" # 包装为 LangChain Tool weather_tool = Tool( name="get_weather", func=fetch_weather, description="获取城市天气" ) # 创建 Agent agent = create_react_agent( llm=llm, tools=[weather_tool], prompt=prompt ) agent_executor = AgentExecutor(agent=agent, tools=[weather_tool], verbose=True) # 执行 result = agent_executor.invoke({ "input": "北京明天天气怎么样？" }) print(result["output"])

5. 性能优化与常见问题

5.1 显存不足解决方案

问题	解决方案
显存 >24GB 报错	使用 GPTQ 4-bit 或 AWQ 量化版本
推理速度慢	启用 vLLM 加速（见下节）
上下文截断	确保`num_ctx=131072`并检查 prompt 长度

5.2 使用 vLLM 提升吞吐量

vLLM 是高效的推理引擎，支持 PagedAttention，显著提升并发性能。

# 安装 vLLM pip install vllm # 启动 API 服务 python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-14B \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --max-model-len 131072

LangChain 可通过 OpenAI 兼容接口接入：

from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY", model="Qwen3-14B" )

5.3 常见错误排查

错误现象	原因	解决方法
`CUDA out of memory`	显存不足	使用量化模型或升级硬件
`Connection refused`	Ollama 未启动	运行`ollama serve`
`Model not found`	名称拼写错误	使用`ollama list`查看已加载模型
`Context length exceeded`	输入过长	分块处理或启用滑动窗口

6. 总结

6.1 核心收获

本文完整实现了基于通义千问3-14B与LangChain的对话系统开发流程，涵盖：

本地部署：通过 Ollama 快速加载 Qwen3-14B，支持 128K 长文本和双模式推理
可视化交互：集成 Ollama WebUI，提供友好的前端体验
智能编排：利用 LangChain 实现记忆管理、函数调用与 Agent 自主决策
性能优化：推荐使用 vLLM 提升推理效率，适用于生产级部署

6.2 最佳实践建议

优先选择 FP8/GPTQ 量化模型，平衡性能与资源消耗；
长文档处理时启用 Thinking 模式，提升逻辑推理准确性；
生产环境建议搭配 vLLM + FastAPI，提高并发能力和响应速度；
商用项目务必遵守 Apache 2.0 协议，可自由使用但需保留版权声明。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

保姆级教程：用通义千问3-14B和Langchain开发对话应用