Short-Segment Speaker Verification via Multi-Perspective Feature Fusion from Self-Supervised Speech Representations
- Author(s)
- Jisoo Myoung
- Type
- Thesis
- Degree
- Master
- Department
- 정보컴퓨팅대학 전기전자컴퓨터공학과
- Advisor
- Shin, Jong Won
- Abstract
- Speaker verification systems have shown impressive performance on long utterances. However, their performance significantly degrades on short-duration utterances. To address this issue, a multi-resolution encoder has been proposed to extract low-level representations at multiple temporal resolutions and condition the hidden representations of ECAPA-TDNN, achieving state-of-the-art performance for utterances shorter than 2 seconds. In this work, we propose a multi-perspective feature fusion method to enhance the baseline system. The method extracts fused features by multiplying a learnable matrix with representations from a pre-trained self-supervised model and injects them as conditional information into each SE-Res2Block of ECAPA-TDNN. Experimental results on VoxCeleb1-O demonstrate that our method further improves performance on short utterances compared to the baseline.
- URI
- https://scholar.gist.ac.kr/handle/local/31952
- Fulltext
- http://gist.dcollection.net/common/orgView/200000900138
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.