Multimodal Audio-Visual Speech Recognition using a Hybrid Approach Combining Open Cloud-based Speech API with Deep Neural Network-based Lipreading
- Abstract
- Recent advances in deep learning technology have enabled its application to various industries with excellent results. Specifically, deep learning-based voice-recognition technology has been notably successful. Speech recognition technology translates human vocal signals into a machine-understandable language and enables automation of services in different fields, including security and finance, thereby facilitating the adoption of a comprehensive digital lifestyle. However, speech recognition technology still faces a critical limitation: the difficulty to perform recognition in loud environments. Current speech recognition technology performs well in some situations, such as indoors and in automobiles, but not in others. Through the work herein presented, we attempt to resolve this problem by merging visual information with a speech recognition system that comprises three modules to address the additional issues that arise during the process.
The first part is visual speech recognition (VSR), which transcribes speech using only visual input to interpret tongue and teeth motions. Recent research has shown that deep learning outperforms lipreading in VSR, delivering a higher accuracy when tested on benchmark datasets. However, the use of VSR systems continues to pose some challenges. One of them is word ambiguity, which results from the inability to distinguish words with similar pronunciations, known as homophones. Another technical drawback of typical VSR systems is that the visual information does not provide adequate data for learning words whose vocalization lasts less than 0.02 s, such as “a,” “an,” “eight,” and “bin.” This paper proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs), namely, a three-dimensional CNN, a densely connected three-dimensional CNN, and a multilayer feature fusion three-dimensional CNN, plus a two-layer bidirectional gated recurrent unit. To train the entire network, we used a connectionist temporal classification technique. Conventional assessment metrics for automated voice recognition indicate that the proposed architecture reduced the character and word error rates of the baseline model for the unseen speaker dataset by 5.681% and 11.282%, respectively. In the presence of visual ambiguity, our proposed design performs better, increasing the reliability of VSR in real-world applications.
The second part is deep learning, which has stimulated research on noise-robust automatic speech recognition (ASR). The combination of cloud computing and artificial intelligence has resulted in a considerable enhancement of the performance of open cloud-based speech recognition (OCSR) application programming interfaces (APIs). ASR that is noise-robust and suitable for various situations has been developed. This study proposes noise-robust OCSR APIs based on an end-to-end lipreading architecture for practical applications in various contexts. We used Google’s Voice Command Dataset v2 to analyze different OCSR APIs, including those from Google, Microsoft, Amazon, and Naver, and determine their optimal performance. To improve performance and provide comprehensive semantic information for keywords, we combined the Microsoft API with Google's trained Word2Vec model. We then incorporated the retrieved word vector into the lipreading architecture developed for audio-visual speech recognition. Lastly, after concatenation, we categorized the API and vision vectors. The proposed design increased the average accuracy of the OCSR API by 14.42% based on conventional ASR assessment metrics and the signal-to-noise ratio, outperforming other models in a variety of noisy environments and increasing the dependability of OCSR APIs in practical applications.
Both ASR and VSR have garnered considerable attention due to recent breakthroughs in deep learning. Although VSR systems must recognize speech from both frontal and profile face images in real-world situations, most VSR research focuses only on frontal face images. Therefore, the third part of our system is a multi-view VSR architecture for face images collected from four distinct views (frontal, 30°, 45°, and 60°). In the proposed architecture, we used multiple CNNs, along with a spatial attention module, dropout, spatial dropout designed specifically to detect subtle differences in the mouth patterns corresponding to identically uttered phrases, and an encoder capable of capturing both short-term and long-term spatiotemporal data. Additionally, a cascaded local self-attention connectionist temporal classification decoder generates output messages that partially resolve the conditional independence requirement problem within the hidden neural layers. This results in significantly improved performance and faster convergence. Additionally, the local self-attention module captures extensive contextual information about the immediate environment. Experimental results on the OuluVS2 dataset indicate that the proposed architecture is on average 9.1% more accurate than the baseline, improving the performance and utility of multi-view VSR for real-world applications.
This thesis proposes a multimode interaction approach based on audio and visual information that makes speech-based virtual aquarium systems noise-robust. In the proposed approach, a list of words recognized by a voice API is expressed as word vectors for audio-based speech recognition using a pretrained model, and a composite end-to-end deep neural network is utilized for vision-based speech recognition. The vectors obtained from the API and vision are then concatenated and categorized. On the basis of information from four different types of noisy settings, the signal-to-noise ratio of the proposed system was calculated. It was also compared with single-mode techniques for extracting visual characteristics and audio speech recognition to evaluate its accuracy and efficacy. When using only voice, its average recognition rate was 91.42%; this increased by 6.7% to 98.12% when both audio and visual information were used. The proposed system could be invaluable in real-world contexts where voice recognition is often used, such as cafés, museums, music venues, and kiosks.
In conclusion, this thesis proposes a system consisting of deep learning-based visual, audio-visual, and multi-view visual speech recognition. This state-of-the-art system achieved outstanding results when tested on two benchmark datasets and demonstrated a stable performance in various environments, unlike existing voice-recognition technology, which delivers high performance only in specific environments. In addition, application of the proposed system to a virtual aquarium environment to evaluate its efficacy resulted in improved recognition performance. Consequently, the proposed system demonstrates the potential to permeate and improve every aspect of our daily lives, including healthcare.
- Author(s)
- Sanghun Jeon
- Issued Date
- 2023
- Type
- Thesis
- URI
- https://scholar.gist.ac.kr/handle/local/19510
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.