Speech Synthesis Based on Multi-Speaker Adaptation and Phonetic Synchronization for Automatic Dubbing
- Abstract
- The advancement of text-to-speech (TTS) technology has enabled the generation of personalized voices for new speakers, leading to a significant increase in demand for automatic dubbing. Effective automatic dubbing requires the generation of high-quality personalized voices and ensuring synchronization across different languages.
This paper proposes an optimization method for low-resource speaker adaptation using the end-to-end Variational Inference with Adversarial Learning for Text-to-Speech (VITS) model to address the challenge of synthesizing personalized voices. Our approach involves optimizing critical components of the VITS model, such as the encoder modules, flow network, generator, speaker embedding layers, and projection layers, using parameter-efficient modules like low-rank adaptation (LoRA), adapters, and conditional layer normalization (CLN). Additionally, we extended our proposed method to cross-lingual models, demonstrating the capability to generate personalized voices across different languages using the your-tts model.
Moreover, synchronization in automatic dubbing is achieved through techniques such as isochrony, which matches speech rate and pauses, and lip synchronization. Instead of synchronizing languages with the same word order, we propose a method for synchronizing languages with different word orders. Our method involves identifying pauses by detecting pause segments in the source speech based on energy changes, determining the speaking rate by calculating the speaking rate of the target speech based on the desired speed and enhancing lip-sync by improving phonetic synchronization through aligning vowel positions between the source and target texts.
The results are based on experiments using the Libri-TTS, VCTK, Korean multi-speaker speech synthesis (KMSSS) datasets, movie data, and Korean-English lip reading data. The low-resource multi-speaker adaptation experiments demonstrate that the proposed method achieves comparable results to fully fine-tuned models by adjusting only 10% of the model parameters for personalized speech. Furthermore, the phonetic synchronization experiments show that the proposed method achieves higher performance than voice actors for Korean to English with a Lip Sync Error - Distance (LSE-D) score of 10.56 and a Lip Sync Error - Confidence (LSE-C) score of 1.55 and slightly lower performance for English to Korean with an LSE-D score of 11.13 and an LSE-C score of 1.39.
- Author(s)
- Changi Hong
- Issued Date
- 2024
- Type
- Thesis
- URI
- https://scholar.gist.ac.kr/handle/local/19689
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.