OAK

Target Speaker Extraction Using Multi-Stage Cross-Attention and Frequency-Wise State Initialization

Metadata Downloads
Author(s)
Kim, HyeonseungShin, Jong Won
Type
Article
Citation
IEEE SIGNAL PROCESSING LETTERS, v.33, pp.773 - 777
Issued Date
2026-01
Abstract
Several recent target speaker extraction (TSE) models directly utilize enrollment speech without explicitly extracting low-dimensional speaker embeddings. However, these methods typically inject the speaker information only once at the input of the speaker extraction network, which may be insufficient because the conditioning information can become diluted as it propagates through repeated separator blocks. In this letter, we propose a TSE model built upon the TF-GridNet, which is a speech separation model performing dual-path modeling in the time-frequency domain with cross-frame self-attention modules. In the proposed TSE model, the self-attention modules in the first $M$ separator blocks are replaced by cross-attention between the enrollment speech and the mixture signal, providing speaker information in multiple stages without introducing additional parameters or computation compared with the original TF-GridNet blocks. In addition, the initial hidden and cell states of the inter-frame long short-term memory (LSTM) modules are determined for each frequency from the enrollment speech. As the pattern of the temporal correlation may be different for each frequency depending on the pitch and speaking style, speaker-dependent frequency-wise state initialization would be helpful. Experimental results showed that the proposed TSE model demonstrated the best PESQ scores and comparable SI-SDRs with lower computational complexity.
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
ISSN
1070-9908
DOI
10.1109/LSP.2026.3657998
URI
https://scholar.gist.ac.kr/handle/local/33657
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.