OAK

GIST Library Login

GIST Scholar College of Information and Computing Department of Electrical Engineering and Computer Science 4. Theses(Ph.D)

Machine Learning-based Speech Enhancement Using Spectro-Temporal Sparsity Analysis

Metadata Downloads

Author(s): Kwang Myung Jeon

Type: Thesis

Degree: Doctor

Department: 대학원 전기전자컴퓨터공학부

Advisor: Kim, Hong Kook

Abstract: Speech enhancement is one of the classic signal processing problems that have been studied for over 40 years and aims to improve intelligibility and perceptual quality of distorted speech signal due to channel effects. Classical speech enhancement methods are mostly based on statistical based approaches, which are not suitable for application to complex and various types of real-world noise environments. In order to overcome these limitations, speech enhancement method based on machine learning have been developed in recent years. Owing to increased training data and computational power, machine learning based speech enhancement methods have outperformed conventional statistical-based speech enhancement methods in many cases. However, most of the machine learning-based speech enhancement methods developed so far have limitations in that the performance is largely dropped when the training data-set's characteristics regarding audio recording condition, type of noise, and noise mixing criterion are differed from the evaluation one.
In order to overcome these limitations, this thesis first proposes three different machine learning based speech enhancement method that commonly utilize a novel spectro-temporal sparsity analysis in the data mismatch condition, which are: 1) online noise dictionary learning, 2) spectral reconstruction using sparse binary mask, and 3) sparsity-based phase spectrum compensation. First, the proposed online noise learning method measures the separation reliability for each time-frequency region by calculating the sparseness of the time-frequency region with respect to the posterior signal-to-noise ratio of speech and noise that are estimated by the non-negative matrix factorization (NMF). The time-frequency sparsity and the ratio between speech/noise activations are then analyzed to decide whether the noise bases need to be updated or not in an online manner. Finally, to obtain the noise bases determined to be updated, a new noise basis is learned from the current noise estimates and active matrices through the discriminative non-noise matrix decomposition technique. Experimental results showed that the proposed on-line noise learning method had superior speech separation and quality performance compared to the conventional exemplar-based sound source separation method and the semi-supervised NMF method for the identical training data.
Second, a spectral reconstruction method based on the sparsity-based binary mask is proposed to improve the quality of speech enhancement in speech/noise overlapping condition. The proposed method first estimates a sparse binary mask that divides the success/failure regions of speech separation using time-frequency sparsity analysis. Unlike the conventional binary mask estimation using a fixed threshold value, the proposed method uses an adaptive threshold value using a sparsity value for each region. After estimating the sparsity binary mask, the proposed method reconstructs the speech components of the speech loss region using the unsupervised NMF with the speech component observed by the sparsity binary mask. Experimental results showed that the proposed spectral reconstruction method improved speech enhancement performance by partially restoring speech components that failed to be separated under data mismatch conditions.
Third, a sparsity-based phase spectrum compensation (SPSC) function for single-channel source separation is proposed to improve the quality of reconstructed signals. While conventional approaches to the reconstruction of separated sources use the input phase spectrum, the proposed SPSC function modifies each source’s phase spectrum using the estimated magnitude spectra of multiple sources and their spectro-temporal sparsity. In particular, the spectro-temporal sparsity is estimated from the signal-to-interference ratio between the magnitude spectrum of a source to be separated and those of the other sources, including the background noise. To evaluate effectiveness of the proposed SPSC function, it is first embedded into the reconstruction stage of magnitude-based source separation methods that employ deep recurrent neural network (DRNN) and sparse non-negative matrix factorization (SNMF), respectively. Then, speech denoising is performed under four different noise conditions with signal-to-noise ratios (SNRs) in the range of 0–15 dB. It is shown from objective and subject tests that both DRNN and SNMF-based speech separation methods using the proposed PSC function could substantially outperform those using the noisy input phase spectrum, particularly under lower SNR conditions.
In addition to the three proposal using the spectro-temporal sparsity analysis, a novel neural network architecture based on hybridization between U-Net and NMF is proposed to further improve the speech enhancement in an unseen noise environment. The proposed method takes advantages of both the accurate separation for known noise environments by the U-Net and quick responsiveness to unseen noises by the NMF with online dictionary learning technique. To merge the two different architectures, a modified U-Net having temporal activation layer (TAU-Net) is jointly optimized with the NMF models that represent the universal speech and noise, respectively. At the inference, the proposed method first estimates temporal activations from the encoder of TAU-Net. After that, the NMF with online dictionary learning adjusts the initially given temporal activations to suppress their cross-talks due to unseen noises that are not known at the training phase of the TAU-Net. Finally, clean speech is obtained via inputting the adjusted temporal activations to the TAU-Net’s decoder. The effectiveness of the proposed speech enhancement method is evaluated on various unseen noise environments. It is shown that the proposed method over all kinds of test conditions outperforms the state-of-the-arts in terms of signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ).
Finally, a speech enhancement method that integrates the all the proposed element technologies were developed to compare with state-of-the-art speech enhancement methods including deep recurrent neural networks, U-Net, and generative adversarial networks. Objective experiments regarding SDR, SIR, PESQ, short term objective intelligibility (STOI), and segmental SNR resulted that the proposed integrated speech enhancement method outperformed the recent deep learning based ones that rely on large amounts of training data, including deep recurrent neural networks, U-Net, and generative adversarial networks, in various unseen noise conditions.

URI: https://scholar.gist.ac.kr/handle/local/32748

Fulltext: http://gist.dcollection.net/common/orgView/200000909114

Alternative Author(s): 전광명

Appears in Collections:: Department of Electrical Engineering and Computer Science > 4. Theses(Ph.D)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.