Exploring the Limit of Vision-Language Model’s Wordy Text-Image Representations
- Author(s)
- Shin, Dongmin
- Type
- Thesis
- Degree
- Master
- Department
- 정보컴퓨팅대학 AI융합학과
- Advisor
- Jeon, Hae-Gon
- Abstract
- With the advent of Vision-Language Models (VLMs) such as CLIP, embedding features from both text and image in the same space becomes possible, producing significant progress in zero-shot downstream tasks. Thanks to text-image encoding, VLMs have been used as encoders for vision- language tasks, with CLIP being a representative model. Subsequently, it has been found that CLIP learns biases from data. In this work, we mathematically explore a new bias of CLIP which leads to ambiguous discrimination when detailed descriptions are given. Although detailed descriptions increase the entropy of text features, they also lead to higher uncertainty in image-text similarity. In particular, when we input a wrong word in a long lengthy line for either an object or attribute, CLIP fails to find it. To provide technical evidence of this bias, we construct a wordy description dataset and offer an evaluation protocol. To address the limitation, we then introduce a lightweight yet effective refinement module to ensure object and attribute features in the wordy descriptions. Leveraging the refinement module improves the model’s generalization ability to distinguish positives and negatives. The experiments on our wordy description dataset and 21 zero-shot image classification tasks validate the effectiveness of the proposed approach over relevant works.
- URI
- https://scholar.gist.ac.kr/handle/local/31893
- Fulltext
- http://gist.dcollection.net/common/orgView/200000896014
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.