OAK

GIST Library Login

검색

GIST Repository College of Information and Computing Department of AI Convergence 3. Theses(Master)

Speech Synthesis Based on Multi-Speaker Adaptation and Phonetic Synchronization for Automatic Dubbing

Metadata Downloads

Author(s): Changi Hong

Type: Thesis

Degree: Master

Department: 대학원 AI대학원

Advisor: Kim, Hong Kook

Abstract: The advancement of text-to-speech (TTS) technology has enabled the generation of personalized voices for new speakers, leading to a significant increase in demand for automatic dubbing. Effective automatic dubbing requires the generation of high-quality personalized voices and ensuring synchronization across different languages.
This paper proposes an optimization method for low-resource speaker adaptation using the end-to-end Variational Inference with Adversarial Learning for Text-to-Speech (VITS) model to address the challenge of synthesizing personalized voices. Our approach involves optimizing critical components of the VITS model, such as the encoder modules, flow network, generator, speaker embedding layers, and projection layers, using parameter-efficient modules like low-rank adaptation (LoRA), adapters, and conditional layer normalization (CLN). Additionally, we extended our proposed method to cross-lingual models, demonstrating the capability to generate personalized voices across different languages using the your-tts model.
Moreover, synchronization in automatic dubbing is achieved through techniques such as isochrony, which matches speech rate and pauses, and lip synchronization. Instead of synchronizing languages with the same word order, we propose a method for synchronizing languages with different word orders. Our method involves identifying pauses by detecting pause segments in the source speech based on energy changes, determining the speaking rate by calculating the speaking rate of the target speech based on the desired speed and enhancing lip-sync by improving phonetic synchronization through aligning vowel positions between the source and target texts.
The results are based on experiments using the Libri-TTS, VCTK, Korean multi-speaker speech synthesis (KMSSS) datasets, movie data, and Korean-English lip reading data. The low-resource multi-speaker adaptation experiments demonstrate that the proposed method achieves comparable results to fully fine-tuned models by adjusting only 10% of the model parameters for personalized speech. Furthermore, the phonetic synchronization experiments show that the proposed method achieves higher performance than voice actors for Korean to English with a Lip Sync Error - Distance (LSE-D) score of 10.56 and a Lip Sync Error - Confidence (LSE-C) score of 1.55 and slightly lower performance for English to Korean with an LSE-D score of 11.13 and an LSE-C score of 1.39.

URI: https://scholar.gist.ac.kr/handle/local/19689

Fulltext: http://gist.dcollection.net/common/orgView/200000878603

Appears in Collections:: Department of AI Convergence > 3. Theses(Master)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Repository는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.