Multi-channel Speech Separation with Gammatone Filterbank in Reverberant Environments
- Abstract
- Recently, various deep learning based multi-channel speech separation (MCSS) models have been proposed to address the performance degradation problem of single-channel models in reverberant environments. Among them, there are several approaches that use the spectral feature of the reference channel signal and the additional inter-channel features as the input of the masking-based single-channel separator such as a fully convolutional time-domain audio separation network (Conv-TasNet). Some utilize hand-crafted spatial features such as inter-channel phase difference (IPD), others extract cross-channel features in a data-driven manner (e.g., inter-channel convolution difference, ICD).
In this paper, we propose a multi-channel version of a multi-phase gammatone filterbank based speech separation network. It is shown that the speech separation varies depending on which optimization technique is used. Our experimental results show that the multi-phase Gammatone filterbank based feature has comparable performance and is more explainable than the feature extracted by a learnable encoder. Moreover, evaluation results show that the proposed Gammatone feature-based MCSS model outperforms existing MCSS models in both Wall Street Journal 0 (WSJ0) 2-mix dataset and LibriSpeech 2-mix dataset in reverberant environments. In addition, we show that the model trained on the English dataset also performs on a Korean dataset, without any fine-tuning.
- Author(s)
- Jinwoo Oh
- Issued Date
- 2022
- Type
- Thesis
- URI
- https://scholar.gist.ac.kr/handle/local/19508
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.