Evolution of Neural Image Caption Generation: From RNN towards Transformers
- Author(s)
- Olimov Farrukh
- Type
- Thesis
- Degree
- Master
- Department
- 대학원 전기전자컴퓨터공학부
- Advisor
- Jeon, Moongu
- Abstract
- Image caption generation is one of the core problems in artificial intelligence that leverages enhancements in computer vision and natural language processing. The ability of describing the image is complicated by the fact that it is challenging to find the correct relationship among objects and, generate semantically and syntactically correct descriptions. In this study, various generative models have been covered from simple recurrent neural network models to more sophisticated transformer-based models. The main objective is to show that image conceals information about the actions by limiting to one frame, and disregarding those features degrades the capability of model to correctly interpret information. In comparison to pioneer models that only take into account the integral image, novel algorithms consider distinct objects present in the image along with their geometrical properties. To tackle aforementioned problems, we have proposed models using transformers for both object detection and sentence generation. In this study, the relationship among objects is learnt using visual information and geometrical properties of objects such as relative width and height. Furthermore, the association among objects and words is learnt utilizing self-attention mechanism and our experiments visualize these associations. The proposed models are trained on MSCOCO dataset and validated using Bilingual Evaluation Understudy Score (BLEU) metric.
- URI
- https://scholar.gist.ac.kr/handle/local/33188
- Fulltext
- http://gist.dcollection.net/common/orgView/200000907503
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.