OAK

GIST Library Login

GIST Scholar College of Information and Computing Department of AI Convergence 3. Theses(Master)

Real-time Dynamic Subtitling for Group Conversation in Augmented Reality

Metadata Downloads

Author(s): Junyoung Park

Type: Thesis

Degree: Master

Department: 대학원 융합기술학제학부(문화기술프로그램)

Advisor: Moon, Bochang

Abstract: Recently, proof-of-concept applications for the dynamic subtitling have been proposed in the augmented reality (AR) environment. The dynamic subtitling technique detects a person who is speaking from the visual cue (i.e. speaker detection) and place the subtitle into the detected speaker. Technically, the previous dynamic subtitling technique showed the excellent performance to understand offline multimedia contents, however, it is not appropriate to utilize for visual streaming data since the calculations required by conventional methods are computationally extensive. In addition, we find that the frame-to-frame consistency of the calculated visual information is not guaranteed when the camera or speaker continuously move in the viewing direction. In this paper, we propose a new dynamic subtitling technique which estimates an active speaker in the group conversational situations in real-time AR environment. To this end, we have designed a speaker matching process, which exploits visual and speech information. The speaker matching process allows visual information to be properly matched to the detected speech signals. Our method translates the speech signal obtained through the built-in microphone into smallest units from phonetic levels, and uses the converted speech signal as a reference to compare the speaker candidate's mouth shape from visual cue to find the active speaker candidates. We have demonstrated that our dynamic subtitling method with an head-mounted display (HMD) and the results are numerically improved to compare with the previous speaker detection method in a variety of verbal situations.|본 연구는 증강현실 환경에서 실시간 화자 검출 기술을 위한 새로운 동적 자막화 방법을 제안한다. 동적 자막화 방법은 일반적으로 시각 정보를 활용하여 어떤 사람이 말을 하고 있는지 찾는 화자 검출 기술과 검출된 화자에 최적의 자막을 배치하기 위한 방법으로 구분된다. 최근 고글형 디스플레이의 발전 덕분에 증강현실 환경에서 화자 검출 기술을 기반으로 한 개념 증명 (proof-of-concept)기반 프레임워크를 제안하는 논문들이 주목을 받아 왔지만, 그동안 방대하게 연구되어온 오프라인 멀티미디어 환경 에서의 화자 검출 방법 대비 실시간 환경에서 획득할 수 있는 스트리밍 영상 데이터를 효과적으로 처리하기 위한 연구는 좀처럼 진행되고 있지 않다. 이는 자막을 배치하기 위한 사용자 선호도 기반의 인간-컴퓨터 상호 작용 연구와 구분되는 기술적인 한계로써 시각 정보를 실시간으로 처리하기 위한 과정에서 발생하는 근본적인 문제들이 존재하기 때문에 발생한다. 기존의 동적 자막화 기술을 위한 화자 검출 방법들은 카메라 입력으로부터 얻은 시각 정보를 처리하기 위해 요구되는 계산량이 크기 때문에 실시간 환경에서 활용되기에는 적합하지 않다. 또한 카메라 혹은 화자가 멈추어있지 않고 계속해서 움직이는 고글형 디스플레이를 착용한 증강현실 환경의 경우 계산되는 시각 정보들의 프레임 간 일관성이 강건하게 유지되지 않는다는 한계가 존재한다. 본 연구진은 이 문제를 해결하고자 기계학습 기반의 분류기를 이용한 실시간 화자 검출 방법을 연구하였다. 제안하는 화자 검출 방법은 내장된 마이크를 통해 얻은 음성 정보를 음성학 관점에서 정의된 최소 단위로 분리한 다음, 변환된 음성 정보를 내장된 카메라로부터 얻은 영상 정보와 비교하는 기술을 기반으로 실시간으로 다중 대화 상황에서 화자를 검출한다.

URI: https://scholar.gist.ac.kr/handle/local/33070

Fulltext: http://gist.dcollection.net/common/orgView/200000909008

Alternative Author(s): 박준영

Appears in Collections:: Department of AI Convergence > 3. Theses(Master)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.