OAK

A study on speech signal processing to improve speech quality and recognition accuracy

Metadata Downloads
Abstract
As the need for devices capable of duplicate communication and human and computer interaction is rapidly increasing in the real world, research on speech communication and speech recognition are having a lot of attention. However, interfering signals such as background noise, reverberation, and acoustic echo that we could encounter in the real environment are also input into the microphone along with the speech, causing deterioration of speech quality and reduction of speech recognition accuracy. To solve this problem, speech signal processing modules such as single- and multi-channel speech enhancement, linear acoustic echo cancellation, and residual echo suppression are used to improve speech quality and speech recognition accuracy. Although the modules have shown good improving performance in terms of speech quality and speech recognition accuracy, each module has problems due to spatial aliasing, background noise, reverberation, many parameters, and high computational complexity, respectively. In this dissertation, we propose improved speech processing modules to improve speech quality and speech recognition accuracy in the presence of noise, reverberation, and acoustic echo. The algorithm modules improved in this dissertation are as follows: (i) the acoustic echo cancellation module that removes the signal output from the loudspeaker is inputted to the microphone again, (ii) the sound source localization module that estimates the speaker's direction from delay information of the microphone array with background noise and reverberation, and (iii) speech enhancement module that removes background noise while preserving the speech signal.
First, in the sound source localization module, we propose a method that considers all the phase wrapping candidates to alleviate the spatial aliasing problem caused by the long spacing between the microphone arrays. Interchannel phase difference, which is widely used as one of the spatial cues, could be wrapped in the high-frequency band according to the distance between the microphones and the direction of the sound source. It can lead to incorrect estimation of direction-of-arrivals, which can cause to degrade the performance of the multi-channel speech enhancement method based on direction-of-arrivals information. To solve this problem, we propose a probabilistic voting method to contribute equally each frequency to the direction-of-arrivals histogram while considering all the phase wrapping candidates that may occur in the high-frequency band. In addition, by introducing a signal-to-noise ratio-based mask and a coherence-based mask, we can select the interchannel phase differences that are less corrupted by background noise and reverberation. In the results, we can extract direction-of-arrivals information robust to spatial aliasing, background noise, and reverberation, which the direction-of-arrivals information can be used for the multi-channel speech enhancement module.
Second, in the acoustic echo cancellation module, we propose a residual echo cancellation method that removes residual echoes due to misalignment of linear echo cancellation or non-linear components of linear acoustic echo cancellation. We consider the harmonic distortion caused by the external vibration of the device according to the output of the loudspeaker to model the residual echo. It also estimates the residual echo by taking into account temporal correlations such as the past estimates of the linear acoustic echo, the past estimates of the residual echo, and the microphone signal. Based on the results of the double talk detector, the residual echoes for single talk and double talk situations are respectively estimated to reduce distortion of the estimated target speech signal and remove acoustic echoes effectively.
Third, we propose two deep learning-based single-channel speech enhancement modules as follows: (i) a multiple-layer perceptron-based cgMLP-SE module combining the convolutional token mixing module and the squeeze-and-excitation network and (ii) a WavLM-based speech enhancement module. The cgMLP-SE model with low computational complexity and small model size is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer, which showed good performance in various speech processing tasks. Although WavLM is a self-supervised speech feature representation model that showed state-of-the-art performance in various speech signal processing tasks, it hasn't shown the effect of WavLM for speech enhancement. Therefore, we propose two WavLM modifications for speech enhancement: using the regression-based training objective and the noise mixing strategy during pre-training. By combining the WavLM modifications and two high-quality speech enhancement systems (LSTM and Conformer), we can see that the performance in terms of speech quality and speech recognition accuracy was improved notably.
Author(s)
Hyungchan Song
Issued Date
2023
Type
Thesis
URI
https://scholar.gist.ac.kr/handle/local/18934
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.