Single and multi-channel speech enhancement incorporating statistical model and deep-learning approach
- Author(s)
- Minseung Kim
- Type
- Thesis
- Degree
- Doctor
- Department
- 대학원 전기전자컴퓨터공학부
- Advisor
- Shin, Jong Won
- Abstract
- In many applications such as voice communication, hearing aids, speech recognition, speaker verification, and meeting summarization, environmental noise often degrades the perceived quality and intelligibility of the speech signal. To address this problem, many speech enhancement techniques have been developed for the last decades. Statistical model-based speech enhancement techniques which employ clean speech estimators derived from various optimisation criteria including a Wiener filter, minimum mean square error short-time spectral amplitude (MMSE-STSA), and minimum mean square error log-spectral amplitude (MMSE-LSA) are essential for applications for mobile devices with low algorithmic delays and low computational complexity. Recently, approaches unifying deep learning techniques into a statistical speech enhancement framework were proposed, including Deep Xi and DeepMMSE in which \textit{a priori} signal-to-noise ratios (SNRs) were estimated by deep neural networks (DNNs) and noise power spectral density (PSD) and spectral gain functions were computed with estimated parameters. Meanwhile, nowadays the devices with multiple microphones became popular, which enabled multi-channel speech enhancement exploiting spatial information as well as the spectro-temporal characteristics of the input signals. In this dissertation, we introduce single- and multi-channel speech enhancement frameworks based on statistical models, and then propose improved parameter estimation schemes, and further complete the framework by incorporating the deep learning approaches into a statistical speech enhancement framework.
Recently, the speech enhancement adopting a speech PSD uncertainty model has been proposed. This approach distinguishes the true speech PSD from its estimate and considers both as random variables. It incorporates a prior distribution of speech spectra and speech PSD estimators to derive the PSD uncertainty-aware counterpart to conventional clean speech estimators, which results in performance improvement. However, the speech PSD uncertainty model has not yet been adopted for parameter estimations such as $\textit{a posteriori}$ speech presence probability, noise PSD, and speech power spectra estimations in the speech enhancement framework. In this dissertation, we incorporate the speech PSD uncertainty model into all the components of the statistical model-based speech enhancement framework by deriving PSD uncertainty-aware counterparts to conventional parameter estimators. Specifically, we derive the $\textit{a posteriori}$ speech presence probability (SPP) where the likelihood function for each hypothesis is based on the speech PSD uncertainty. With this $\textit{a posteriori}$ SPP, a novel SPP-based noise PSD estimator is derived. Also, we derive the MMSE estimator for the power spectrum of the clean speech in the current frame under speech PSD uncertainty which is exploited to refine the speech PSD estimator. Finally, the refined speech PSD estimator is incorporated into the spectral gain function based on the speech PSD uncertainty model. The proposed approach showed improved noise PSD estimation performance in terms of the averaged logarithmic error distance, and improved speech enhancement performance in terms of the noise reduction, segmental signal-to-noise ratio, perceptual evaluation of speech quality (PESQ) scores and short-time objective intelligibility (STOI) in our experiments. It also exhibited comparable performance with a real-time deep learning-based speech enhancement system in terms of the PESQ scores and composite measures for the VoiceBank-DEMAND dataset.
Secondly, we propose an improved DeepMMSE (iDeepMMSE) which estimates the speech PSD and SPP as well as the \textit{a priori} SNR using a DNN for MMSE estimation of the speech and noise PSDs.
The \textit{a priori} and \textit{a posteriori} SNRs are refined with the estimated PSDs, which in turn are used to compute spectral gain function. We also replaced the DNN architecture with the Conformer which efficiently captures the local and global sequential information.
Experimental results on the Voice Bank-DEMAND dataset and Deep Xi dataset showed the proposed iDeepMMSE outperformed the DeepMMSE in terms of the PESQ scores and composite objective measures.
Online multi-channel speech enhancement aims to extract target speech from multiple noisy inputs exploiting the spatial information as well as the spectro-temporal characteristics with low latency. Acoustic parameters such as acoustic transfer function and speech and noise spatial covariance matrices (SCMs) should be estimated in a causal manner to enable online estimation of the clean speech spectra. Thirdly, we propose an improved estimator for the speech SCM, which can be parametrized with the speech PSD and relative transfer function (RTF). Specifically, we adopt the temporal cepstum smoothing (TCS) scheme to estimate the speech PSD, which was conventionally estimated with temporal smoothing. Also, we propose a novel RTF estimator based on a time difference of arrival (TDoA) estimate obtained by the cross-correlation method. Furthermore, we propose to refine the initial estimate of speech SCM utilizing the estimates for the clean speech spectrum and clean speech power spectrum. The proposed approach showed superior performance in terms of the PESQ scores, extended STOI (eSTOI), and scale-invariant signal-to-distortion ratio (SISDR) in our experiments on the CHiME-4 database.
Multi-channel speech enhancement systems usually consist of spatial filtering such as minimum-variance distortionless-response (MVDR) beamforming and post-processing, which require acoustic parameters including RTF, noise SCM, and \textit{a priori} SNR.
Fourthly, we propose a DNN-based parameter estimation for MVDR beamforming and post-filtering. Specifically, we propose to utilize a DNN to estimate the interchannel phase differences of the clean speech and the \textit{a posteriori} speech presence probability, which are used to estimate the RTF and the noise SCM for MVDR beamforming. As for the post-processing, we adopt the Deep Xi framework in which another DNN is employed to estimate the \textit{a priori} SNR used to compute spectral gains. The proposed method exhibited superior performance compared to the previous approaches with similar sizes especially in terms of the perceptual evaluation of speech quality scores for the CHiME-4 dataset.
- URI
- https://scholar.gist.ac.kr/handle/local/19670
- Fulltext
- http://gist.dcollection.net/common/orgView/200000883065
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.