Label-attention transformer with geometrically coherent objects for image captioning
- Abstract
- Encoder-decoder-based image captioning techniques are generally utilized to describe meaningful information present in an image. In this work, we investigate two unexplored ideas for image captioning using the transformer: 1) an object-focused label attention module (LAM), and 2) a geometrically coherent proposal (GCP) module that focuses on the scale and position of objects to benefit the transformer model by attaining better image perception. These modules demonstrate the enforcement of objects’ relevance in the surrounding environment. Furthermore, they explore the effectiveness of learning an explicit association between vision and language constructs. LAM and GCP tolerate the variation in objects’ class and its association with labels in multi-label classification. The proposed framework, label-attention transformer with geometrically coherent objects (LATGeO), acquires proposals of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using LAM. The module LAM associates the extracted objects classes to the available dictionary using self-attention layers. Object coherence is acquired in the GCP module using the localized ratio of the proposals’ geometrical features. In this study, experimentation results are performed on MSCOCO dataset. The evaluation of LATGeO on MSCOCO advocates that objects’ relevance in surroundings and their visual features binding with geometrically localized ratios and associated labels generate improved and meaningful captions. © 2022
- Author(s)
- Dubey, Shikha.; Olimov, F.; Rafique, M.A.; Kim, J.; Jeon, Moongu
- Issued Date
- 2023-04
- Type
- Article
- DOI
- 10.1016/j.ins.2022.12.018
- URI
- https://scholar.gist.ac.kr/handle/local/10276
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.