Cost-efficient Active Learning for Referring Image Segmentation and Grounding
- Author(s)
- Junbeom Hong
- Type
- Thesis
- Degree
- Master
- Department
- 정보컴퓨팅대학 AI융합학과
- Advisor
- Kim, Sundong
- Abstract
- Visual grounding (VG) suffers from prohibitive annotation costs because it requires not only precise region labels (i.e., masks or boxes) but also detailed descriptions of that region. We tackle this annotation bottleneck by formulating active learning (AL) for VG under the realistic setting where the unlabeled pool consists of only raw im- ages without accompanying text. However, estimating sample informativeness without ground-truth text remains challenging, as the model must still assess how well each image disambiguates the referred region from visually similar distractors. To address this, we generate auxiliary region-text pairs using foundation models, and introduce Text-Grounded Region Entropy, a new acquisition function that measures whether the model’s confidence collapses onto a single region or disperses across multiple candi- dates. It allows our method to prioritize images with strong cross-region competition, i.e., visually ambiguous yet highly informative ones. We further design a cost-efficient annotation interface that reduces the labor-intensive labeling of both masks and expres- sions with just a few clicks. In experiments, our AL framework consistently outperforms several AL baselines on RIS and REC benchmarks, while achieving up to 6× and 1.4× faster for mask and text labeling efficiency in the user study.
- URI
- https://scholar.gist.ac.kr/handle/local/33703
- Fulltext
- http://gist.dcollection.net/common/orgView/200000944972
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.