OAK

Exploring the Limit of Vision-Language Model’s Wordy Text-Image Representations

Metadata Downloads
Author(s)
Shin, Dongmin
Type
Thesis
Degree
Master
Department
정보컴퓨팅대학 AI융합학과
Advisor
Jeon, Hae-Gon
Abstract
With the advent of Vision-Language Models (VLMs) such as CLIP, embedding features from both text and image in the same space becomes possible, producing significant progress in zero-shot downstream tasks. Thanks to text-image encoding, VLMs have been used as encoders for vision- language tasks, with CLIP being a representative model. Subsequently, it has been found that CLIP learns biases from data. In this work, we mathematically explore a new bias of CLIP which leads to ambiguous discrimination when detailed descriptions are given. Although detailed descriptions increase the entropy of text features, they also lead to higher uncertainty in image-text similarity. In particular, when we input a wrong word in a long lengthy line for either an object or attribute, CLIP fails to find it. To provide technical evidence of this bias, we construct a wordy description dataset and offer an evaluation protocol. To address the limitation, we then introduce a lightweight yet effective refinement module to ensure object and attribute features in the wordy descriptions. Leveraging the refinement module improves the model’s generalization ability to distinguish positives and negatives. The experiments on our wordy description dataset and 21 zero-shot image classification tasks validate the effectiveness of the proposed approach over relevant works.
URI
https://scholar.gist.ac.kr/handle/local/31893
Fulltext
http://gist.dcollection.net/common/orgView/200000896014
Alternative Author(s)
신동민
Appears in Collections:
Department of AI Convergence > 3. Theses(Master)
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.