OAK

Advancing Image Captioning with Regional Attention in Vision-Language Transformers and Multimodal Learning Zubia Naz School of Electrical Engineering and Computer Science Gwangju Institute of Science and Technology

Metadata Downloads
Abstract
비주얼대형언어모델(Visual Large Language Models, VLLMs)은복잡한의료이미 지를해석하고정확하며맥락에맞는텍스트설명을생성하는문제를해결함으로써의료 이미지 캡셔닝을 혁신적으로 변화시키고 있습니다. 본 연구에서는 Swin Transformer와 BART의 성능을 강화하기 위해 의료 이미지의 특정 영역에 동적으로 집중할 수 있는 메커니즘인 **지역 주의(regional attention)**를 도입했습니다. 이 혁신은 캡션 생성 과정에서 중요한 세부사항을 강조하도록 보장하며, 의료 이미지의 복잡하고 국지적인 특징을 포착하는 데 종종 어려움을 겪는 기존 모델의 한계를 극복합니다. 지역 주의를 활용하여 본 프레임워크는 의료적으로 중요한 세부사항을 식별하는 데 있어 더 높은 정밀도를 달성하여, 임상 의사결정을 개선하고 방사선 전문의가 의료 스캔에서 상태를 진단하는 데 있어 더 신뢰할 수 있고 상세한 캡션을 생성할 수 있도록 지원합니다. ROCO데이터셋을기반으로평가된본모델은 ROUGE및 BERTScore지표에서유 의미한개선을보여주었습니다. ROUGE는생성된캡션과참조텍스트간의주요단어의 겹침을 측정하여 언어적 정확성을 반영하며, BERTScore는 의미적 유사성을 평가하여 생성된 캡션의 맥락적 관련성과 임상적 유용성을 나타냅니다. 지역 주의의 통합은 세밀 한시각적세부사항과맥락적뉘앙스를효과적으로포착하도록모델을지원하여이러한 – iii – 발전에 직접 기여합니다. BLEU와 CIDEr와 같은 다른 지표들은 일관성을 유지하여 접 근 방식의 강건함을 확인시켰습니다. 지역 주의에 의해 주도된 의료 이미지 캡셔닝의 이 진보는 자동화된 의료 진단을 개선하고 방사선 전문의를 보다 정밀하고 해석 가능한 방식으로 지원하는 데 있어 상당한 이점을 제공합니다. ©2025 주 비 아 나즈 ALL RIGHTS RESERVED|Visual Large Language Models (VLLMs) are transforming medical image captioning by addressing the challenges of interpreting complex medical images and generating accurate, context-aware textual descriptions. In this work, we enhance the capabilities of Swin Transformer and BART by incorporating regional attention, a mechanism that dynamically focuses on specific regions of medical images. This innovation ensures that critical details are emphasized during caption generation, overcoming limitations of traditional models that often struggle to capture intricate and localized features of medical images. By leveraging regional attention, our framework achieves a higher precision in identifying medically significant details, thereby enhancing clinical decision-making and generating more reliable, detailed captions to support radiologists in diagnosing conditions from medical scans. Evaluated on the ROCO dataset, our model demonstrates significant improvements in ROUGE and BERTScore metrics. ROUGE measures the overlap of essential words between generated captions and reference texts, reflecting linguistic accuracy, while BERTScore evaluates semantic similarity, indicating the con- textual relevance and clinical utility of the generated captions. The integration of regional attention directly contributes to these advancements by enabling the model to effectively capture fine-grained visual details and contextual nuances. Other metrics, such as BLEU and CIDEr, remained consistent, confirming the robustness of the
approach. This advancement in medical image captioning, driven by regional attention, offers substantial benefits for improving automated medical diagnostics and supporting radiologists with enhanced precision and interpretability. ©2025 Zubia Naz ALL RIGHTS RESERVED
Author(s)
NAZ ZUBIA
Issued Date
2025
Type
Thesis
URI
https://scholar.gist.ac.kr/handle/local/18842
Alternative Author(s)
Zubia Naz
Department
대학원 전기전자컴퓨터공학부
Advisor
Jeon, Moongu
Table Of Contents
Abstract (English) i
Abstract (Korean) iii
List of Contents v
List of Tables vii
List of Figures viii
List of Algorithms ix
1 Introduction 1
1.1 Introduction 1
1.2 Motivation 2
2 Preliminary 4
2.1 Visual Language Models(VLMs) 4
2.1.1 Image Features Representation 4
2.1.2 Natural Language Representation 6
2.2 Major Architectures 7
2.2.1 CLIP Architecture in Image Captioning 8
2.2.2 BLIP Architecture in Image Captioning 8
2.3 Limitations of Unified Approach 9
3 Baseline Method 10
3.1 Overview of GIT Architecture 10
3.1.1 Image Encoder 10
3.1.2 Text Decoder 11
3.1.3 Vision-Language Interaction Mechanism 11
3.1.4 Training and Fine Tuning 12
3.2 Similar Approaches Following GIT Architecture 12
3.2.1 CMRE-UoG team at ImageCLEFmedical Caption 2022: Concept
Detection and Image Captioning 12
3.2.2 Transferring Pre-Trained Large Language-Image Model for Med-
ical Image Captioning 14
– v –
4 Proposed Methodology 16
4.1 Architectural Overview 16
4.1.1 Image Encoder: Swin Transformer with Regional Attention 17
4.1.2 Text Decoder: BART-Based Model with Biomedical Embeddings
and Regional Features 20
4.1.3 Visual Interaction and Training Strategy 22
4.1.4 Dataset and Data Processing 22
4.1.5 Advantages of Model Choices 23
4.1.6 Limitations and Challenges 23
4.1.7 Practical Implications and Future Directions 24
5 Results and Conclusions 25
5.1 Overview of Evaluation Metrics 25
5.2 Results and Performance Analysis 26
5.2.1 Comparative Analysis 27
5.2.2 Qualitative Analysis 28
5.3 Limitations and Future Directions 29
5.4 Conclusion 30
Summary 32
References 34
A Abbreviations 38
Acknowledgements 39
– vi –
Degree
Master
Appears in Collections:
Department of Electrical Engineering and Computer Science > 3. Theses(Master)
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.