清华团队首创Lip Forcing：2步实现实时唇动同步！-深圳市維司達科技有限公司

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Authors: Lip Forcing Team |Year: 2026 |arXiv: 2606.11180

二、研究背景

基于扩散的唇语同步方法实现了强视觉质量和音视频对齐，但全序列双向注意力和多步去噪使其难以用于实时推理。

核心挑战：如何在保持视觉质量的同时，将推理速度从教师模型的数秒/帧压缩到实时（>25 FPS）？

四、实验

模型规模：1.3B 和 14B 两个学生规模（均从 14B 教师蒸馏）

指标：FPS（实时性）、SyncNet 分数（同步性）、LPIPS/SSIM（参考保真度）、首帧时延

模型	FPS	相对加速
14B 教师	~0.8	1×
1.3B 双向基线	~1.8	2.3×
1.3B Lip Forcing	31	17.6×（vs 同规模双向）
14B Lip Forcing	~32	39.8×（vs 教师）

首帧时延：低于 1 毫秒（所有扩散基线均远高于此）。

报告生成时间：2026-06-11 | 论文来源：arXiv:2606.11180

原文摘要:Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS,17.6 × 17.6\times17.6×faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs39.8 × 39.8\times39.8×faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.

PDF链接:https://arxiv.org/pdf/2606.11180v1

部分平台可能图片显示异常，请以我的博客内容为准

WindowsCleaner终极指南：如何彻底解决C盘爆红问题并让电脑重获新生

WindowsCleaner终极指南：如何彻底解决C盘爆红问题并让电脑重获新生【免费下载链接】WindowsCleaner Windows Cleaner——专治C盘爆红及各种不服！ 项目地址: https://gitcode.com/gh_mirrors/wi/WindowsCleaner 你是否曾经打开电脑，看…

李华

别再死记硬背公式了！用PyTorch一行代码搞懂InfoNCE Loss的实战用法

一行代码解锁InfoNCE Loss：用PyTorch实战对比学习核心技巧在自监督学习的浪潮中，InfoNCE Loss已经成为对比学习领域的基石。但许多开发者在初次接触这个损失函数时，往往会被其复杂的数学公式吓退。本文将揭示一个令人惊喜的事实：用…

李华

Java 实践报告（二）

一、实践目标本次实践主要包含两个学习任务：理解软件工程中的形式化方法及其在 Java 开发中的作用。阅读《大象：Thinking in UML》一书，总结面向对象建模的核心思想。二、形式化方法学习总结2.1 什么是形式化方法形式化方法是软件工程中以数…

李华

计算机毕业设计之django在线视频电影网站的设计与实现

在线视频电影网站系统设计的目的是为用户提供视频电影等方面的平台。与其它应用程序相比，在线视频电影的设计主要面向于用户，旨在为管理员和用户提供一个在线视频电影网站。用户可以通过系统及时查看视频电影等。在线视频电影网站系统是在Windows操作系统…

李华

python5.3-数据容器-列表切片

介绍：切片是指对操作的数据截取其中一部分的操作。列表、字符串、元组都支持切片操作（序列类型的数据类型都支持切片）语法：序列数据[开始索引 : 结束索引 : 步长]不包含结束索引位置对应的元素（开始索引未指定默认为0&…

李华