OAK

GIST Library Login

Metadata Downloads

Abstract: Several recent target speaker extraction (TSE) models directly utilize enrollment speech without explicitly extracting low-dimensional speaker embeddings. However, these methods typically inject the speaker information only once at the input of the speaker extraction network, which may be insufficient because the conditioning information can become diluted as it propagates through repeated separator blocks. In this letter, we propose a TSE model built upon the TF-GridNet, which is a speech separation model performing dual-path modeling in the time-frequency domain with cross-frame self-attention modules. In the proposed TSE model, the self-attention modules in the first $M$ separator blocks are replaced by cross-attention between the enrollment speech and the mixture signal, providing speaker information in multiple stages without introducing additional parameters or computation compared with the original TF-GridNet blocks. In addition, the initial hidden and cell states of the inter-frame long short-term memory (LSTM) modules are determined for each frequency from the enrollment speech. As the pattern of the temporal correlation may be different for each frequency depending on the pitch and speaking style, speaker-dependent frequency-wise state initialization would be helpful. Experimental results showed that the proposed TSE model demonstrated the best PESQ scores and comparable SI-SDRs with lower computational complexity.

Appears in Collections:: Department of Electrical Engineering and Computer Science > 1. Journal Articles

공개 및 라이선스

qrcode

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.