Phi-3.5-mini-instruct部署优化教程：减少VRAM碎片，稳定运行超24小时-深圳市維司達科技有限公司

Phi-3.5-mini-instruct部署优化教程：减少VRAM碎片，稳定运行超24小时

1. 模型简介

Phi-3.5-mini-instruct是微软推出的轻量级开源指令微调大模型，在长上下文代码理解（RepoQA）、多语言MMLU等基准测试中表现优异，显著超越同规模模型，部分任务甚至能与更大规模的模型媲美。

这款模型特别适合本地和边缘部署，在NVIDIA RTX 4090单卡上即可运行，显存占用约7GB。本教程将重点介绍如何优化部署，减少VRAM碎片，实现稳定运行超过24小时。

2. 环境准备

2.1 硬件要求

GPU: NVIDIA GeForce RTX 4090 D (23GB VRAM)
内存: 建议32GB以上
存储: 至少20GB可用空间

2.2 软件依赖

包	版本要求
Python	3.8+
transformers	4.57.6
torch	2.8.0+cu128
gradio	6.6.0
protobuf	7.34.1

推荐使用conda创建独立环境：

conda create -n torch28 python=3.8 conda activate torch28 pip install transformers==4.57.6 torch==2.8.0 gradio==6.6.0 protobuf==7.34.1

3. 基础部署步骤

3.1 模型下载与放置

将模型文件放置在指定目录：

mkdir -p /root/ai-models/AI-ModelScope/ mv Phi-3___5-mini-instruct /root/ai-models/AI-ModelScope/

3.2 项目结构准备

创建项目目录结构：

/root/Phi-3.5-mini-instruct/ ├── webui.py # Gradio WebUI主程序 ├── logs/ │ ├── phi35.log # stdout日志 │ └── phi35.err # stderr日志

3.3 启动脚本编写

创建webui.py文件，基础启动代码如下：

from transformers import AutoModelForCausalLM, AutoTokenizer import gradio as gr import torch model_path = "/root/ai-models/AI-ModelScope/Phi-3___5-mini-instruct" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto" ) def generate(text, max_length=256, temperature=0.3, top_p=0.8, top_k=20, repetition_penalty=1.1): inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_length=max_length, temperature=temperature, top_p=top_p, top_k=top_k, repetition_penalty=repetition_penalty, use_cache=False # 避免DynamicCache bug ) return tokenizer.decode(outputs[0], skip_special_tokens=True) iface = gr.Interface( fn=generate, inputs=[ gr.Textbox(label="输入文本"), gr.Slider(32, 1024, value=256, label="最大长度"), gr.Slider(0.1, 1.0, value=0.3, label="温度"), gr.Slider(0.1, 1.0, value=0.8, label="Top-p"), gr.Slider(1, 100, value=20, label="Top-k"), gr.Slider(1.0, 2.0, value=1.1, label="重复惩罚") ], outputs="text", title="Phi-3.5-mini-instruct演示" ) iface.launch(server_name="0.0.0.0", server_port=7860)

4. 部署优化技巧

4.1 减少VRAM碎片的关键配置

在模型加载时添加以下优化参数：

model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True, # 减少CPU内存使用 attn_implementation="flash_attention_2", # 使用Flash Attention max_memory={0: "20GiB"} # 限制显存使用 )

4.2 Supervisor配置优化

创建/etc/supervisor/conf.d/phi-3.5-mini-instruct.conf文件：

[program:phi-3.5-mini-instruct] command=/opt/miniconda3/envs/torch28/bin/python /root/Phi-3.5-mini-instruct/webui.py directory=/root/Phi-3.5-mini-instruct user=root autostart=true autorestart=true startretries=5 stopwaitsecs=30 stdout_logfile=/root/Phi-3.5-mini-instruct/logs/phi35.log stderr_logfile=/root/Phi-3.5-mini-instruct/logs/phi35.err environment=PATH="/opt/miniconda3/envs/torch28/bin:%(ENV_PATH)s"

4.3 定期内存清理策略

在webui.py中添加定期清理函数：

import gc import threading import time def memory_cleaner(): while True: time.sleep(3600) # 每小时清理一次 torch.cuda.empty_cache() gc.collect() cleaner_thread = threading.Thread(target=memory_cleaner, daemon=True) cleaner_thread.start()

5. 常见问题解决

5.1 DynamicCache错误处理

如果遇到'DynamicCache' object has no attribute 'seen_tokens'错误，有两种解决方案：

降级transformers：

pip install "transformers<5.0.0"

在生成时禁用cache：

outputs = model.generate(..., use_cache=False)

5.2 显存泄漏排查

使用以下命令监控显存使用情况：

watch -n 1 nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv

如果发现显存持续增长，可以尝试：

降低max_length参数
减少并发请求数
确保use_cache=False设置正确

5.3 服务稳定性检查

检查服务状态：

supervisorctl status phi-3.5-mini-instruct

查看错误日志：

tail -f /root/Phi-3.5-mini-instruct/logs/phi35.err

6. 总结

通过本教程的优化配置，Phi-3.5-mini-instruct可以在RTX 4090上稳定运行超过24小时，显存占用保持在7-8GB之间。关键优化点包括：

使用flash_attention_2提高注意力计算效率
设置low_cpu_mem_usage减少内存碎片
定期执行显存清理
通过Supervisor实现自动重启
正确处理DynamicCache兼容性问题

这些优化措施不仅适用于Phi-3.5-mini-instruct，也可以应用于其他类似规模的LLM模型部署场景。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Phi-3.5-mini-instruct部署优化教程：减少VRAM碎片，稳定运行超24小时