Phi-3.5-mini-instruct RTX 4090部署教程：7860端口WebUI访问+API测试全步骤-深圳市維司達科技有限公司

Phi-3.5-mini-instruct RTX 4090部署教程：7860端口WebUI访问+API测试全步骤

1. 项目介绍

Phi-3.5-mini-instruct是微软推出的轻量级开源指令微调大模型，在长上下文代码理解（RepoQA）、多语言MMLU等基准测试中表现优异，显著超越同规模模型，部分任务甚至能与更大模型媲美。该模型特别适合本地或边缘部署，在RTX 4090单卡上仅需约7GB显存即可流畅运行。

核心优势：

轻量化：7.6GB模型大小，7.7GB显存占用
高性能：在代码理解和多语言任务中表现突出
易部署：支持Gradio WebUI和API访问

2. 环境准备

2.1 硬件要求

GPU：NVIDIA RTX 4090（23GB VRAM）
显存：至少8GB可用显存
存储：至少15GB可用空间（模型+环境）

2.2 软件依赖

conda create -n torch28 python=3.9 conda activate torch28 pip install transformers==4.57.6 protobuf==7.34.1 gradio==6.6.0 torch==2.8.0+cu128

重要提示：避免使用transformers 5.5.0版本，该版本存在DynamicCache bug会导致生成错误。

3. 模型部署

3.1 项目结构准备

mkdir -p /root/Phi-3.5-mini-instruct/logs cd /root/Phi-3.5-mini-instruct

3.2 下载模型

将模型放置到指定路径：

mkdir -p /root/ai-models/AI-ModelScope/ # 假设模型已下载到/root/ai-models/AI-ModelScope/Phi-3___5-mini-instruct

3.3 创建WebUI主程序

创建webui.py文件：

import gradio as gr from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "/root/ai-models/AI-ModelScope/Phi-3___5-mini-instruct" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path).cuda() def generate(text, max_length=256, temperature=0.3, top_p=0.8, top_k=20, repetition_penalty=1.1): inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_length=max_length, temperature=temperature, top_p=top_p, top_k=top_k, repetition_penalty=repetition_penalty, use_cache=False # 避免transformers 5.5.0的bug ) return tokenizer.decode(outputs[0], skip_special_tokens=True) iface = gr.Interface( fn=generate, inputs=[ gr.Textbox(label="输入文本"), gr.Slider(32, 1024, value=256, label="最大长度"), gr.Slider(0.1, 1.0, value=0.3, label="Temperature"), gr.Slider(0.1, 1.0, value=0.8, label="Top-p"), gr.Slider(1, 100, value=20, label="Top-k"), gr.Slider(1.0, 2.0, value=1.1, label="重复惩罚") ], outputs="text", title="Phi-3.5-mini-instruct 演示" ) iface.launch(server_name="0.0.0.0", server_port=7860)

4. 服务管理

4.1 Supervisor配置

创建配置文件/etc/supervisor/conf.d/phi-3.5-mini-instruct.conf：

[program:phi-3.5-mini-instruct] command=/opt/miniconda3/envs/torch28/bin/python /root/Phi-3.5-mini-instruct/webui.py directory=/root/Phi-3.5-mini-instruct user=root autostart=true autorestart=true stdout_logfile=/root/Phi-3.5-mini-instruct/logs/phi35.log stderr_logfile=/root/Phi-3.5-mini-instruct/logs/phi35.err environment=PATH="/opt/miniconda3/envs/torch28/bin:%(ENV_PATH)s"

4.2 服务控制命令

# 启动服务 supervisorctl start phi-3.5-mini-instruct # 停止服务 supervisorctl stop phi-3.5-mini-instruct # 重启服务 supervisorctl restart phi-3.5-mini-instruct # 查看状态 supervisorctl status phi-3.5-mini-instruct # 查看日志 tail -f /root/Phi-3.5-mini-instruct/logs/phi35.log

5. 访问与测试

5.1 WebUI访问

服务启动后，通过浏览器访问：

http://服务器IP:7860

界面提供以下参数调节：

最大长度：控制生成文本长度（32-1024）
Temperature：控制生成随机性（0.1-1.0）
Top-p：核采样概率（0.1-1.0）
Top-k：Top-k采样（1-100）
重复惩罚：避免重复（1.0-2.0）

5.2 API测试

使用curl测试API接口：

curl -X POST http://localhost:7860/gradio_api/call/generate \ -H "Content-Type: application/json" \ -d '{"data":["你好，请介绍一下Phi-3.5模型",256,0.3,0.8,20,1.1]}'

6. 常见问题解决

6.1 服务启动失败

检查错误日志：

tail /root/Phi-3.5-mini-instruct/logs/phi35.err

常见原因：

端口冲突：检查7860端口是否被占用
```
ss -tlnp | grep 7860
```

GPU不可用：验证CUDA是否可用

python -c "import torch; print(torch.cuda.is_available())"

6.2 生成质量不佳

调整参数：

降低temperature（0.1-0.3）
减小max_length
增加repetition_penalty（1.2-1.5）

6.3 显存不足

检查GPU使用情况：

nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv

优化建议：

减小max_length
使用更低精度的模型（如4bit量化）

7. 总结

通过本教程，您已经完成了Phi-3.5-mini-instruct在RTX 4090上的完整部署流程，包括：

环境准备与依赖安装
模型部署与WebUI配置
Supervisor服务管理
WebUI和API访问测试
常见问题解决方法

该模型在保持轻量化的同时提供了优秀的性能表现，特别适合需要本地部署的开发者使用。通过Gradio提供的友好界面，即使没有编程经验的用户也能轻松体验模型能力。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Phi-3.5-mini-instruct RTX 4090部署教程：7860端口WebUI访问+API测试全步骤