大模型服务弹性扩容：指标驱动 HPA 与 GPU 资源池化的工程实践-深圳市維司達科技有限公司

大模型服务弹性扩容：指标驱动 HPA 与 GPU 资源池化的工程实践

一、大模型扩容的特殊挑战：冷启动、GPU 碎片与指标滞后

大模型推理服务的弹性扩容与传统微服务有本质差异。传统微服务的扩容只需拉起新 Pod、注册到注册中心即可，通常在 5-10 秒内完成。但大模型推理服务的扩容面临三重特殊挑战。

冷启动瓶颈：新 Pod 启动后需要将模型权重从磁盘加载到 GPU 显存，7B 模型加载约需 30 秒，70B 模型加载约需 2 分钟。在此期间，新 Pod 无法承接任何推理请求。Kubernetes 原生 HPA 基于指标触发扩容后，需要等待 Pod 就绪，但就绪探针（readinessProbe）无法区分"Pod 已启动但模型未加载"和"Pod 已就绪可承接流量"两种状态。

GPU 资源碎片：Kubernetes 默认调度器按 CPU/内存分配 Pod，不感知 GPU 拓扑。当集群中存在多型号 GPU（如 A100 和 V100 混合部署）时，调度器可能将需要 A100 的推理 Pod 调度到 V100 节点上，导致显存不足或推理性能不达标。此外，GPU 是独占型资源，一个 Pod 占用整张 GPU 后，剩余的 CPU 和内存资源无法被其他 GPU Pod 使用，造成资源碎片。

指标滞后：Kubernetes HPA 默认每 15 秒采集一次指标，扩容决策基于过去 1-5 分钟的平均值。对于脉冲式流量（如大促开始瞬间），指标采集和决策的滞后导致扩容响应不及时，流量洪峰已经过去才完成扩容。

二、弹性扩容架构：预测性扩容 + 温备池 + GPU 拓扑感知

针对上述三重挑战，需要构建一个超越原生 HPA 的弹性扩容体系。

flowchart TD subgraph 流量入口 REQ[推理请求] --> GATEWAY[API 网关] GATEWAY --> QUEUE[请求队列] end subgraph 指标采集层 METRICS[指标采集器<br/>队列深度/TTFT/GPU利用率] PREDICTOR[流量预测器<br/>基于历史周期预测] end QUEUE --> METRICS subgraph 扩容决策层 HPA_ENHANCED[增强型 HPA 控制器] WARM_POOL[温备池管理器] SCHEDULER[GPU 拓扑感知调度器] end METRICS --> HPA_ENHANCED PREDICTOR --> HPA_ENHANCED HPA_ENHANCED -->|扩容指令| WARM_POOL HPA_ENHANCED -->|扩容指令| SCHEDULER subgraph GPU 资源池 subgraph 活跃实例 ACTIVE1[推理 Pod 1<br/>模型已加载] ACTIVE2[推理 Pod 2<br/>模型已加载] end subgraph 温备池 WARM1[温备 Pod 1<br/>模型已预加载<br/>未注册路由] WARM2[温备 Pod 2<br/>模型已预加载<br/>未注册路由] end subgraph 冷备池 COLD1[GPU 节点<br/>待分配] end end WARM_POOL -->|秒级激活| WARM1 WARM_POOL -->|秒级激活| WARM2 SCHEDULER -->|GPU 感知调度| COLD1 WARM1 -->|激活后注册路由| ACTIVE1 style HPA_ENHANCED fill:#e74c3c,color:#fff style WARM_POOL fill:#27ae60,color:#fff style SCHEDULER fill:#3498db,color:#fff style WARM1 fill:#e67e22,color:#fff style WARM2 fill:#e67e22,color:#fff

预测性扩容：基于历史流量周期（如每日高峰时段、每周促销日）提前预测流量趋势，在流量高峰到来前 5 分钟触发预扩容。预测性扩容不替代响应式 HPA，而是作为补充——预测性扩容负责应对可预见的流量模式，响应式 HPA 负责应对突发流量。

温备池：维护一组已加载模型但未注册路由的 Pod。温备 Pod 的 GPU 显存中已加载模型权重，但不对外提供服务。当 HPA 触发扩容时，温备 Pod 只需注册到服务路由即可承接流量，激活时间从分钟级缩短到秒级。温备池的大小根据历史扩容频率和成本预算动态调整。

GPU 拓扑感知调度：自定义 Kubernetes 调度器，在分配推理 Pod 时考虑 GPU 型号、显存大小和 NVLink 拓扑。对于需要多卡并行的模型（如 70B 模型需要 4 张 A100），调度器确保 4 张 GPU 在同一节点上且通过 NVLink 互联，避免跨节点通信的性能损耗。

三、生产级实现

3.1 增强型 HPA：基于自定义指标的扩容

# 基于自定义指标的 HPA 配置 # 核心设计：使用队列深度和 TTFT 作为扩容指标 # 而非传统的 CPU 利用率，因为 GPU 推理的 CPU 利用率不能反映真实负载 apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference-server minReplicas: 2 maxReplicas: 10 metrics: # 指标1：请求队列深度 # 队列深度 > 10 表示当前实例无法及时处理请求 # 之所以选队列深度而非 QPS，是因为推理耗时长，QPS 无法反映排队情况 - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" # 指标2：首 Token 延迟（TTFT） # TTFT > 2s 表示 GPU 资源不足，需要扩容 # TTFT 是用户体验的直接指标，比资源利用率更有业务意义 - type: Pods pods: metric: name: inference_ttft_seconds target: type: AverageValue averageValue: "2" # 指标3：GPU 显存利用率 # 显存利用率 > 85% 表示 KV Cache 接近上限 # 继续增加并发会导致 OOM - type: Pods pods: metric: name: gpu_memory_utilization target: type: AverageValue averageValue: "85" behavior: scaleUp: # 快速扩容：检测到指标超标后立即扩容 # 不使用默认的稳定窗口，因为推理服务对延迟敏感 stabilizationWindowSeconds: 30 policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: # 慢速缩容：缩容前观察 5 分钟，避免流量波动导致频繁缩扩 stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 120

3.2 温备池管理器

""" 温备池管理器 核心设计：维护一组已加载模型的温备 Pod HPA 触发扩容时优先激活温备 Pod，而非创建新 Pod """ import asyncio import logging from dataclasses import dataclass from typing import List, Optional from kubernetes import client, config logger = logging.getLogger(__name__) @dataclass class WarmPod: """温备 Pod 状态""" name: str namespace: str model_loaded: bool registered: bool # 是否已注册到服务路由 class WarmPoolManager: """温备池管理器""" def __init__( self, namespace: str, deployment_name: str, target_warm_count: int = 2 ): # 加载 Kubernetes 配置 config.load_incluster_config() self.apps_api = client.AppsV1Api() self.core_api = client.CoreV1Api() self.namespace = namespace self.deployment_name = deployment_name self.target_warm_count = target_warm_count self.warm_pods: List[WarmPod] = [] async def maintain_warm_pool(self): """ 维持温备池大小 定期检查温备 Pod 数量，不足时创建新 Pod 温备 Pod 使用特殊的 label 标记，不参与服务路由 """ while True: try: # 统计当前温备 Pod 数量 current_warm = len([ p for p in self.warm_pods if p.model_loaded and not p.registered ]) deficit = self.target_warm_count - current_warm if deficit > 0: logger.info( f"温备池不足, 当前={current_warm}, " f"目标={self.target_warm_count}, " f"补充={deficit}" ) # 增加 Deployment 副本数 # 新 Pod 使用 warm=true 标签，不注册到 Service await self._scale_up(deficit) except Exception as e: logger.error(f"温备池维护异常: {e}") await asyncio.sleep(30) async def activate_warm_pod(self) -> Optional[str]: """ 激活一个温备 Pod 将温备 Pod 注册到服务路由，使其可承接流量 返回被激活的 Pod 名称 """ for pod in self.warm_pods: if pod.model_loaded and not pod.registered: # 移除 warm 标签，Pod 自动被 Service 选中 self.core_api.patch_namespaced_pod( name=pod.name, namespace=self.namespace, body={ "metadata": { "labels": { "warm": "false", "ready": "true" } } } ) pod.registered = True logger.info(f"温备 Pod 已激活: {pod.name}") return pod.name logger.warning("无可用温备 Pod") return None async def _scale_up(self, count: int): """增加 Deployment 副本数""" deployment = self.apps_api.read_namespaced_deployment( name=self.deployment_name, namespace=self.namespace ) current_replicas = deployment.spec.replicas or 0 new_replicas = current_replicas + count self.apps_api.patch_namespaced_deployment( name=self.deployment_name, namespace=self.namespace, body={ "spec": { "replicas": new_replicas } } ) logger.info( f"副本数调整: {current_replicas} -> {new_replicas}" )

3.3 GPU 拓扑感知调度

""" GPU 拓扑感知调度器 核心设计：调度推理 Pod 时考虑 GPU 型号和 NVLink 拓扑 确保多卡并行模型的 GPU 在同一节点且 NVLink 互联 """ import logging from typing import Dict, List, Optional from kubernetes import client, config logger = logging.getLogger(__name__) class GPUTopologyScheduler: """GPU 拓扑感知调度器""" def __init__(self): config.load_incluster_config() self.core_api = client.CoreV1Api() def find_best_node( self, required_gpu_count: int, required_gpu_model: str, required_vram_gb: float ) -> Optional[str]: """ 为推理 Pod 寻找最优 GPU 节点 策略： 1. 优先选择 GPU 型号匹配的节点 2. 多卡需求时，优先选择同一节点上的 GPU（NVLink 互联） 3. 避免将 Pod 调度到 GPU 碎片化的节点 """ nodes = self.core_api.list_node() gpu_nodes = self._filter_gpu_nodes( nodes, required_gpu_model, required_vram_gb ) if not gpu_nodes: logger.error( f"无满足条件的 GPU 节点: " f"model={required_gpu_model}, " f"count={required_gpu_count}" ) return None # 按可用 GPU 数量降序排列 # 多卡需求时，优先选择单节点上 GPU 数量足够的节点 # 跨节点 GPU 通信延迟是 NVLink 的 10 倍以上 gpu_nodes.sort( key=lambda x: x['available_gpus'], reverse=True ) for node_info in gpu_nodes: if node_info['available_gpus'] >= required_gpu_count: logger.info( f"选择节点: {node_info['name']}, " f"可用GPU={node_info['available_gpus']}" ) return node_info['name'] # 没有单节点满足需求，需要跨节点 # 记录告警，跨节点部署会显著影响推理性能 logger.warning( f"无单节点满足 {required_gpu_count} 卡需求, " f"将跨节点部署, 性能可能下降" ) return gpu_nodes[0]['name'] if gpu_nodes else None def _filter_gpu_nodes( self, nodes, gpu_model: str, required_vram: float ) -> List[Dict]: """筛选满足条件的 GPU 节点""" result = [] for node in nodes.items: # 检查节点是否有指定型号的 GPU gpu_product = node.metadata.labels.get( 'nvidia.com/gpu.product', '' ) if gpu_model not in gpu_product: continue # 检查可用 GPU 数量 allocatable_gpus = int( node.status.allocatable.get( 'nvidia.com/gpu', '0' ) ) # 已分配的 GPU 数量 allocated_gpus = self._get_allocated_gpus( node.metadata.name ) available_gpus = allocatable_gpus - allocated_gpus if available_gpus > 0: result.append({ 'name': node.metadata.name, 'available_gpus': available_gpus, 'gpu_model': gpu_product, 'total_vram_gb': required_vram }) return result def _get_allocated_gpus(self, node_name: str) -> int: """获取节点上已分配的 GPU 数量""" pods = self.core_api.list_pod_for_all_namespaces( field_selector=f"spec.nodeName={node_name}" ) allocated = 0 for pod in pods.items: for container in pod.spec.containers: gpu_limit = container.resources.limits if gpu_limit and 'nvidia.com/gpu' in gpu_limit: allocated += int( gpu_limit['nvidia.com/gpu'] ) return allocated

四、弹性扩容的代价：温备成本、调度延迟与缩容风险

温备成本：每个温备 Pod 独占一张 GPU，即使不承接流量也在消耗 GPU 资源。2 个温备 Pod 意味着 2 张 GPU 常驻空闲，按 A100 单价 25 元/小时计算，每月额外成本约 3.6 万元。温备池大小需要根据业务 SLA 和成本预算权衡——SLA 要求越高，温备池越大，成本越高。

调度延迟：即使使用温备池，从 HPA 触发扩容到温备 Pod 激活仍需 5-10 秒（包括路由注册和健康检查）。对于瞬间流量飙升的场景，这 5-10 秒的空窗期可能导致请求排队和延迟飙升。解决方案是在网关层配置请求队列，将空窗期内的请求缓冲起来，等温备 Pod 激活后再处理。

缩容风险：缩容后如果流量再次飙升，需要重新经历扩容流程。频繁的缩扩不仅影响服务稳定性，还增加模型加载的 GPU 磨损。建议缩容策略保守：观察窗口设为 5 分钟以上，每次只缩容 1 个 Pod。

五、总结

大模型推理服务的弹性扩容面临冷启动、GPU 碎片和指标滞后三重挑战。增强型 HPA 基于队列深度和 TTFT 等业务指标驱动扩容，比传统 CPU 利用率指标更精准；温备池将扩容响应时间从分钟级缩短到秒级；GPU 拓扑感知调度确保多卡并行模型的 GPU 在同一节点上，避免跨节点通信的性能损耗。但温备成本和缩容风险是不可回避的代价，需要在 SLA 和成本之间找到平衡。

落地路线建议：第一步，为推理服务部署自定义指标采集器，暴露队列深度和 TTFT 指标；第二步，配置基于自定义指标的 HPA，替代默认的 CPU 利用率指标；第三步，搭建温备池管理器，维护 2-3 个温备 Pod 作为快速扩容储备；第四步，实现 GPU 拓扑感知调度，确保多卡模型的 GPU 在同一节点；第五步，建立扩缩容事件和温备池状态的监控看板，持续优化扩容参数。