OAK

Machine Learning-based Speech Enhancement Using Spectro-Temporal Sparsity Analysis

Metadata Downloads
Author(s)
Kwang Myung Jeon
Type
Thesis
Degree
Doctor
Department
대학원 전기전자컴퓨터공학부
Advisor
Kim, Hong Kook
Abstract
Speech enhancement is one of the classic signal processing problems that have been studied for over 40 years and aims to improve intelligibility and perceptual quality of distorted speech signal due to channel effects. Classical speech enhancement methods are mostly based on statistical based approaches, which are not suitable for application to complex and various types of real-world noise environments. In order to overcome these limitations, speech enhancement method based on machine learning have been developed in recent years. Owing to increased training data and computational power, machine learning based speech enhancement methods have outperformed conventional statistical-based speech enhancement methods in many cases. However, most of the machine learning-based speech enhancement methods developed so far have limitations in that the performance is largely dropped when the training data-set's characteristics regarding audio recording condition, type of noise, and noise mixing criterion are differed from the evaluation one.
In order to overcome these limitations, this thesis first proposes three different machine learning based speech enhancement method that commonly utilize a novel spectro-temporal sparsity analysis in the data mismatch condition, which are: 1) online noise dictionary learning, 2) spectral reconstruction using sparse binary mask, and 3) sparsity-based phase spectrum compensation. First, the proposed online noise learning method measures the separation reliability for each time-frequency region by calculating the sparseness of the time-frequency region with respect to the posterior signal-to-noise ratio of speech and noise that are estimated by the non-negative matrix factorization (NMF). The time-frequency sparsity and the ratio between speech/noise activations are then analyzed to decide whether the noise bases need to be updated or not in an online manner. Finally, to obtain the noise bases determined to be updated, a new noise basis is learned from the current noise estimates and active matrices through the discriminative non-noise matrix decomposition technique. Experimental results showed that the proposed on-line noise learning method had superior speech separation and quality performance compared to the conventional exemplar-based sound source separation method and the semi-supervised NMF method for the identical training data.
Second, a spectral reconstruction method based on the sparsity-based binary mask is proposed to improve the quality of speech enhancement in speech/noise overlapping condition. The proposed method first estimates a sparse binary mask that divides the success/failure regions of speech separation using time-frequency sparsity analysis. Unlike the conventional binary mask estimation using a fixed threshold value, the proposed method uses an adaptive threshold value using a sparsity value for each region. After estimating the sparsity binary mask, the proposed method reconstructs the speech components of the speech loss region using the unsupervised NMF with the speech component observed by the sparsity binary mask. Experimental results showed that the proposed spectral reconstruction method improved speech enhancement performance by partially restoring speech components that failed to be separated under data mismatch conditions.
Third, a sparsity-based phase spectrum compensation (SPSC) function for single-channel source separation is proposed to improve the quality of reconstructed signals. While conventional approaches to the reconstruction of separated sources use the input phase spectrum, the proposed SPSC function modifies each source’s phase spectrum using the estimated magnitude spectra of multiple sources and their spectro-temporal sparsity. In particular, the spectro-temporal sparsity is estimated from the signal-to-interference ratio between the magnitude spectrum of a source to be separated and those of the other sources, including the background noise. To evaluate effectiveness of the proposed SPSC function, it is first embedded into the reconstruction stage of magnitude-based source separation methods that employ deep recurrent neural network (DRNN) and sparse non-negative matrix factorization (SNMF), respectively. Then, speech denoising is performed under four different noise conditions with signal-to-noise ratios (SNRs) in the range of 0–15 dB. It is shown from objective and subject tests that both DRNN and SNMF-based speech separation methods using the proposed PSC function could substantially outperform those using the noisy input phase spectrum, particularly under lower SNR conditions.
In addition to the three proposal using the spectro-temporal sparsity analysis, a novel neural network architecture based on hybridization between U-Net and NMF is proposed to further improve the speech enhancement in an unseen noise environment. The proposed method takes advantages of both the accurate separation for known noise environments by the U-Net and quick responsiveness to unseen noises by the NMF with online dictionary learning technique. To merge the two different architectures, a modified U-Net having temporal activation layer (TAU-Net) is jointly optimized with the NMF models that represent the universal speech and noise, respectively. At the inference, the proposed method first estimates temporal activations from the encoder of TAU-Net. After that, the NMF with online dictionary learning adjusts the initially given temporal activations to suppress their cross-talks due to unseen noises that are not known at the training phase of the TAU-Net. Finally, clean speech is obtained via inputting the adjusted temporal activations to the TAU-Net’s decoder. The effectiveness of the proposed speech enhancement method is evaluated on various unseen noise environments. It is shown that the proposed method over all kinds of test conditions outperforms the state-of-the-arts in terms of signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ).
Finally, a speech enhancement method that integrates the all the proposed element technologies were developed to compare with state-of-the-art speech enhancement methods including deep recurrent neural networks, U-Net, and generative adversarial networks. Objective experiments regarding SDR, SIR, PESQ, short term objective intelligibility (STOI), and segmental SNR resulted that the proposed integrated speech enhancement method outperformed the recent deep learning based ones that rely on large amounts of training data, including deep recurrent neural networks, U-Net, and generative adversarial networks, in various unseen noise conditions.
URI
https://scholar.gist.ac.kr/handle/local/32748
Fulltext
http://gist.dcollection.net/common/orgView/200000909114
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.