OAK

GIST Library Login

Metadata Downloads

Abstract: Visual grounding (VG) suffers from prohibitive annotation costs because it requires not only precise region labels (i.e., masks or boxes) but also detailed descriptions of that region. We tackle this annotation bottleneck by formulating active learning (AL) for VG under the realistic setting where the unlabeled pool consists of only raw im- ages without accompanying text. However, estimating sample informativeness without ground-truth text remains challenging, as the model must still assess how well each image disambiguates the referred region from visually similar distractors. To address this, we generate auxiliary region-text pairs using foundation models, and introduce Text-Grounded Region Entropy, a new acquisition function that measures whether the model’s confidence collapses onto a single region or disperses across multiple candi- dates. It allows our method to prioritize images with strong cross-region competition, i.e., visually ambiguous yet highly informative ones. We further design a cost-efficient annotation interface that reduces the labor-intensive labeling of both masks and expres- sions with just a few clicks. In experiments, our AL framework consistently outperforms several AL baselines on RIS and REC benchmarks, while achieving up to 6× and 1.4× faster for mask and text labeling efficiency in the user study.

공개 및 라이선스

qrcode

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.