OAK

Phase-Aware Processing for Speech Enhancement, Reinforcement, and Sound Source Localization

Metadata Downloads
Author(s)
Junhyeong Pak
Type
Thesis
Degree
Doctor
Department
대학원 전기전자컴퓨터공학부
Advisor
Shin, Jong Won
Abstract
Phase information in the signal plays an important role of various applications related speech or audio processing. For speech enhancement, the phase is an essential part in constructing the speech signal with the magnitude. In cases of applications utilizing a multi-microphone array, the direction-of-arrival (DoA) in the time-frequency (TF) domain of the directional sound source can be derived by utilizing the interchannel phase difference (IPD), the phase can be dealt with as important spatial information.

Over the past decades, the phase has been regarded as relatively less important than the magnitude in speech enhancement. Recently, however, studies on improving the performance of speech enhancement using phase estimation have been continuously reported. Accordingly, there are also several studies on phase-aware objective metrics instead of the conventional measures, which only consider the magnitude of the enhanced speech.

In general, since the phase is limited to a range of 2pi, this occurs the discontinuity property, and the phase shows randomized or ambiguous patterns in TF domain. The ambiguous pattern of phase information makes more difficult to derive the estimates by deep neural networks (DNNs), as compared to the magnitude information, which shows significant pattern, such as the log power spectrum or the ratio mask. For this reason, some research groups have proposed indirect estimation approaches using a complex ratio mask instead of directly estimating the phase. As for the multichannel speech enhancement and DoA estimation using deep learning, several studies have been also conducted to take IPD as a feature of input of DNN for considering spatial information. In most of these cases, however, several classes corresponding to the directional angles of sound sources are pre-defined before deep learning, so that these approaches suffer from the resolution degradation related to the number of classes of DoA.

In this thesis, we introduce new approaches considering phase information of the speech signal for speech enhancement, reinforcement, and sound source localization. For phase-aware speech enhancement, we introduce a DNN-based algorithm in order to directly estimate the clean phase from the noisy phase. More specifically, it is supposed to show that a DNN can directly estimate the unwrapped phase of the clean speech, that has a significant pattern comparing with the instantaneous phase, obtained by phase decomposition, and it is also recovered as the instantaneous phase estimate by adding the linear phase. For source localization and DoA estimation, we propose a DNN regression approach that exploits both the noisy and clean IPD, as opposed to most conventional approaches based on DNN classification. In relation to multichannel speech enhancement, we introduce the reinforcement algorithm technique that reflects the binaural unmasking phenomenon by utilizing the DoA calculation based on IPD. The performance of the multichannel speech reinforcement algorithm is expected to be improved by combining the proposed phase-aware processing techniques.

A series of experiments were performed for performance evaluation of each proposed approach, and it was shown that the speech signal processing techniques considering the phase information lead to better results than the conventional methods. From this, it can be confirmed that phase information of the speech signal is a very important cue for enhancement, reinforcement, and sound source localization.
URI
https://scholar.gist.ac.kr/handle/local/32757
Fulltext
http://gist.dcollection.net/common/orgView/200000909145
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.