OAK

Training Strategies for End-to-End Noise-Robust Speech Recognition

Metadata Downloads
Abstract
Automatic speech recognition (ASR) systems convert speech audio signals into text and are widely used in various applications. Traditional ASR consists of an acoustic model (AM) for extracting speech features and a language model (LM) for grammar and lexicon information. Recently, end-to-end (E2E) ASR models using neural networks (NN) have outperformed modular-based architectures. However, these models often perform poorly in low signal-to-noise ratio (SNR) conditions, as they are typically developed in high SNR environments. Speech enhancement (SE) or feature enhancement modules have been studied to improve low SNR performance, but they can introduce artifacts that increase error rates. Alternatively, multi-condition training (MCT) and noise-aware training (NAT) use acoustic noise as a model condition. While MCT is simple and efficient, it has limitations in low SNR conditions. Joint training of SE and ASR models has been proposed to address these issues, but conflicting gradients and frame mismatch problems make performance improvement challenging. This dissertation proposes training approaches to mitigate these joint training problems and enhance ASR performance.

First, to prevent the different tasks of the SE model and ASR model, which are two distinct models, from conflicting with each other, a training approach that separates the training procedure is proposed. The proposed training approach consists of two steps. In the first step, with the parameters of the ASR model are frozen, only the parameters of the SE model are updated using an objective function for speech quality. During this process, regularization term is applied using feature vectors extracted from the ASR encoder. Next, the parameters of both models are updated using the objective functions of SE and ASR.

Secondly, to address the conflicting gradient and frame mismatch problems, an interpreting the pipeline consisting of the SE and ASR models as a teacher-student model is proposed. In other words, the ASR model is interpreted as the teacher model to leverage linguistic knowledge, and the SE model is trained using the fine-grained those information. In addition, to transfer the frame-wise linguistic information, the acoustic tokenizer is employed as surrogate model. The acoustic tokenizer is optimized to predict cluster from k-means clustering using latent vectors of the ASR encoder. The optimized acoustic tokenizer and ASR encoder, as teacher models, transfer linguistic information to the SE model, updating the parameters of the SE model.

Finally, to mitigate problem of the cross-entropy used in the acoustic tokenizer, a pairwise distance-based loss function is proposed. In addition, to enhance the contextual representation, a contrastive learning-based relational representation between acoustic tokens and those sequence is proposed. First, samples with the same/different cluster in the acoustic tokenizer are defined as positive/negative samples, and a cluster-based pairwise distance-based loss is applied to optimize the acoustic tokenizer. Additionally, for contextual representation, contrastive learning is utilized to match the relationship between acoustic tokens and those sequence extracted from the acoustic tokenizer.

The proposed training approaches were evaluated for ASR and SE performance using simulated noisy environments and real-world audio dataset. The proposed training approaches for noise-robust ASR achieved lower word error rates (WER) in ASR performance compared to conventaional training approaches. Moreover, in SE performance, the proposed training approaches that interpreted teacher-student model achieved improved results in speech quality-related metrics compared to a separated trained SE model. These results were indicated as the addressing of the conflicting gradient and frame mismatch problems. Furthermore, comprehensive performance evaluation were conducted to verify the effectiveness of the proposed training approach for different SE and ASR models' architectures. The proposed training approaches consistently achieved better performance compared to conventional training approach in different SE model and ASR model architectures as well.
Author(s)
Geon Woo Lee
Issued Date
2024
Type
Thesis
URI
https://scholar.gist.ac.kr/handle/local/19852
Alternative Author(s)
이건우
Department
대학원 AI대학원
Advisor
Kim, Hong Kook
Degree
Doctor
Appears in Collections:
Department of AI Convergence > 4. Theses(Ph.D)
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.