OAK

GIST Library Login

Metadata Downloads

Abstract: Speaker verification systems have shown impressive performance on long utterances. However, their performance significantly degrades on short-duration utterances. To address this issue, a multi-resolution encoder has been proposed to extract low-level representations at multiple temporal resolutions and condition the hidden representations of ECAPA-TDNN, achieving state-of-the-art performance for utterances shorter than 2 seconds. In this work, we propose a multi-perspective feature fusion method to enhance the baseline system. The method extracts fused features by multiplying a learnable matrix with representations from a pre-trained self-supervised model and injects them as conditional information into each SE-Res2Block of ECAPA-TDNN. Experimental results on VoxCeleb1-O demonstrate that our method further improves performance on short utterances compared to the baseline.

Appears in Collections:: Department of Electrical Engineering and Computer Science > 3. Theses(Master)

공개 및 라이선스

qrcode

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.