Target Speaker Extraction Using Multi-Stage Cross-Attention and Frequency-Wise State Initialization
- Author(s)
- Kim, Hyeonseung; Shin, Jong Won
- Type
- Article
- Citation
- IEEE SIGNAL PROCESSING LETTERS, v.33, pp.773 - 777
- Issued Date
- 2026-01
- Abstract
- Several recent target speaker extraction (TSE) models directly utilize enrollment speech without explicitly extracting low-dimensional speaker embeddings. However, these methods typically inject the speaker information only once at the input of the speaker extraction network, which may be insufficient because the conditioning information can become diluted as it propagates through repeated separator blocks. In this letter, we propose a TSE model built upon the TF-GridNet, which is a speech separation model performing dual-path modeling in the time-frequency domain with cross-frame self-attention modules. In the proposed TSE model, the self-attention modules in the first $M$ separator blocks are replaced by cross-attention between the enrollment speech and the mixture signal, providing speaker information in multiple stages without introducing additional parameters or computation compared with the original TF-GridNet blocks. In addition, the initial hidden and cell states of the inter-frame long short-term memory (LSTM) modules are determined for each frequency from the enrollment speech. As the pattern of the temporal correlation may be different for each frequency depending on the pitch and speaking style, speaker-dependent frequency-wise state initialization would be helpful. Experimental results showed that the proposed TSE model demonstrated the best PESQ scores and comparable SI-SDRs with lower computational complexity.
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- ISSN
- 1070-9908
- DOI
- 10.1109/LSP.2026.3657998
- URI
- https://scholar.gist.ac.kr/handle/local/33657
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.