OAK

Short-Segment Speaker Verification via Multi-Perspective Feature Fusion from Self-Supervised Speech Representations

Metadata Downloads
Author(s)
Jisoo Myoung
Type
Thesis
Degree
Master
Department
정보컴퓨팅대학 전기전자컴퓨터공학과
Advisor
Shin, Jong Won
Abstract
Speaker verification systems have shown impressive performance on long utterances. However, their performance significantly degrades on short-duration utterances. To address this issue, a multi-resolution encoder has been proposed to extract low-level representations at multiple temporal resolutions and condition the hidden representations of ECAPA-TDNN, achieving state-of-the-art performance for utterances shorter than 2 seconds. In this work, we propose a multi-perspective feature fusion method to enhance the baseline system. The method extracts fused features by multiplying a learnable matrix with representations from a pre-trained self-supervised model and injects them as conditional information into each SE-Res2Block of ECAPA-TDNN. Experimental results on VoxCeleb1-O demonstrate that our method further improves performance on short utterances compared to the baseline.
URI
https://scholar.gist.ac.kr/handle/local/31952
Fulltext
http://gist.dcollection.net/common/orgView/200000900138
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.