OAK

GIST Library Login

Metadata Downloads

Abstract: Text-to-speech (TTS) with lip synchronization (TTSLS) is the task of generating a speech signal synchronized with the lip movements in a video given the text transcription and the video without speech. Previous approaches to TTSLS aligned the phoneme sequence and video frames using scaled dot-product attention with a diagonal constraint loss, which was employed to prevent a phoneme from being assigned to video frames too far away. However, the diagonal constraint loss basically assumes that the duration of each phoneme is about the same, which is not always valid as speaking styles can be different. In this letter, we propose a TTSLS system based on speech-assisted text-to-video alignment and masked unit prediction. By utilizing the ground-truth speech signal available in the training phase, we construct a loss function for text-to-video alignment using the text-to-speech alignment obtained by a pre-trained TTS model. To deal with video frames without frontal lip images, we employ a masked unit prediction loss so that the unit predictor in the proposed system can estimate the masked units from the rest of the units. In addition, we modified the probability distribution for the unit predictor using a learnable null embedding for video inspired by classifier-free guidance. Experimental results demonstrated that our proposed method outperformed previous TTSLS systems in both lip-speech synchronization and speech recognition performance. © 1994-2012 IEEE.

Appears in Collections:: Department of Electrical Engineering and Computer Science > 1. Journal Articles

공개 및 라이선스

qrcode

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.