OAK

GIST Library Login

검색

GIST Repository College of Information and Computing Department of Electrical Engineering and Computer Science 4. Theses(Ph.D)

A study on speech signal processing to improve speech quality and recognition accuracy

Metadata Downloads

Author(s): Hyungchan Song

Type: Thesis

Degree: Doctor

Department: 대학원 전기전자컴퓨터공학부

Advisor: Shin, Jong Won

Abstract: As the need for devices capable of duplicate communication and human and computer interaction is rapidly increasing in the real world, research on speech communication and speech recognition are having a lot of attention. However, interfering signals such as background noise, reverberation, and acoustic echo that we could encounter in the real environment are also input into the microphone along with the speech, causing deterioration of speech quality and reduction of speech recognition accuracy. To solve this problem, speech signal processing modules such as single- and multi-channel speech enhancement, linear acoustic echo cancellation, and residual echo suppression are used to improve speech quality and speech recognition accuracy. Although the modules have shown good improving performance in terms of speech quality and speech recognition accuracy, each module has problems due to spatial aliasing, background noise, reverberation, many parameters, and high computational complexity, respectively. In this dissertation, we propose improved speech processing modules to improve speech quality and speech recognition accuracy in the presence of noise, reverberation, and acoustic echo. The algorithm modules improved in this dissertation are as follows: (i) the acoustic echo cancellation module that removes the signal output from the loudspeaker is inputted to the microphone again, (ii) the sound source localization module that estimates the speaker's direction from delay information of the microphone array with background noise and reverberation, and (iii) speech enhancement module that removes background noise while preserving the speech signal.
First, in the sound source localization module, we propose a method that considers all the phase wrapping candidates to alleviate the spatial aliasing problem caused by the long spacing between the microphone arrays. Interchannel phase difference, which is widely used as one of the spatial cues, could be wrapped in the high-frequency band according to the distance between the microphones and the direction of the sound source. It can lead to incorrect estimation of direction-of-arrivals, which can cause to degrade the performance of the multi-channel speech enhancement method based on direction-of-arrivals information. To solve this problem, we propose a probabilistic voting method to contribute equally each frequency to the direction-of-arrivals histogram while considering all the phase wrapping candidates that may occur in the high-frequency band. In addition, by introducing a signal-to-noise ratio-based mask and a coherence-based mask, we can select the interchannel phase differences that are less corrupted by background noise and reverberation. In the results, we can extract direction-of-arrivals information robust to spatial aliasing, background noise, and reverberation, which the direction-of-arrivals information can be used for the multi-channel speech enhancement module.
Second, in the acoustic echo cancellation module, we propose a residual echo cancellation method that removes residual echoes due to misalignment of linear echo cancellation or non-linear components of linear acoustic echo cancellation. We consider the harmonic distortion caused by the external vibration of the device according to the output of the loudspeaker to model the residual echo. It also estimates the residual echo by taking into account temporal correlations such as the past estimates of the linear acoustic echo, the past estimates of the residual echo, and the microphone signal. Based on the results of the double talk detector, the residual echoes for single talk and double talk situations are respectively estimated to reduce distortion of the estimated target speech signal and remove acoustic echoes effectively.
Third, we propose two deep learning-based single-channel speech enhancement modules as follows: (i) a multiple-layer perceptron-based cgMLP-SE module combining the convolutional token mixing module and the squeeze-and-excitation network and (ii) a WavLM-based speech enhancement module. The cgMLP-SE model with low computational complexity and small model size is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer, which showed good performance in various speech processing tasks. Although WavLM is a self-supervised speech feature representation model that showed state-of-the-art performance in various speech signal processing tasks, it hasn't shown the effect of WavLM for speech enhancement. Therefore, we propose two WavLM modifications for speech enhancement: using the regression-based training objective and the noise mixing strategy during pre-training. By combining the WavLM modifications and two high-quality speech enhancement systems (LSTM and Conformer), we can see that the performance in terms of speech quality and speech recognition accuracy was improved notably.

URI: https://scholar.gist.ac.kr/handle/local/18934

Fulltext: http://gist.dcollection.net/common/orgView/200000883063

Alternative Author(s): 송형찬

Appears in Collections:: Department of Electrical Engineering and Computer Science > 4. Theses(Ph.D)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Repository는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.