Text-to-Speech With Lip Synchronization Based on Speech-Assisted Text-to-Video Alignment and Masked Unit Prediction
- Abstract
- Text-to-speech (TTS) with lip synchronization (TTSLS) is the task of generating a speech signal synchronized with the lip movements in a video given the text transcription and the video without speech. Previous approaches to TTSLS aligned the phoneme sequence and video frames using scaled dot-product attention with a diagonal constraint loss, which was employed to prevent a phoneme from being assigned to video frames too far away. However, the diagonal constraint loss basically assumes that the duration of each phoneme is about the same, which is not always valid as speaking styles can be different. In this letter, we propose a TTSLS system based on speech-assisted text-to-video alignment and masked unit prediction. By utilizing the ground-truth speech signal available in the training phase, we construct a loss function for text-to-video alignment using the text-to-speech alignment obtained by a pre-trained TTS model. To deal with video frames without frontal lip images, we employ a masked unit prediction loss so that the unit predictor in the proposed system can estimate the masked units from the rest of the units. In addition, we modified the probability distribution for the unit predictor using a learnable null embedding for video inspired by classifier-free guidance. Experimental results demonstrated that our proposed method outperformed previous TTSLS systems in both lip-speech synchronization and speech recognition performance. © 1994-2012 IEEE.
- Author(s)
- Ahn, Youngdo; Chae, Jongwook; Shin, Jong Won
- Issued Date
- 2025-02
- Type
- Article
- DOI
- 10.1109/LSP.2025.3537949
- URI
- https://scholar.gist.ac.kr/handle/local/9047
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.