OAK

Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems

Metadata Downloads
Abstract
Exposure to varied noisy environments impairs the recognition performance of artificial intelligence-based speech recognition technologies. Degraded-performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log-Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log-Mel spectrograms, transformed for visual-based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal-to-noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three-feature multi-fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise-affected environments owing to its enhanced stability and recognition rate. 1225-6463/$ © 2024 ETRI.
Author(s)
Jeon, SanghunLee, JieunYeo, DohyeonLee, Yong-JuKim, SeungJun
Issued Date
2024-02
Type
Article
DOI
10.4218/etrij.2023-0266
URI
https://scholar.gist.ac.kr/handle/local/9718
Publisher
한국전자통신연구원
Citation
ETRI Journal, v.46, no.1, pp.22 - 34
ISSN
1225-6463
Appears in Collections:
Department of AI Convergence > 1. Journal Articles
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.