OAK

Revisiting data imbalance in token-based self-supervised learning

Metadata Downloads
Author(s)
Han, DaeyoungJung, Hyung RokLi, TianhongKatabi, DinaSon, JeanyKim, Hong KookJeon, Moongu
Type
Article
Citation
Neurocomputing, v.682
Issued Date
2026-06
Abstract
Token-based self-supervised learning (SSL) has emerged as a powerful paradigm for leveraging large-scale unla beled data, yet it suffers from a previously overlooked challenge: token-class imbalance. We show that visual token distributions are highly skewed; a small subset of frequent tokens, often representing uninformative backgrounds, dominates the training process. Conversely, semantically rich but rare tokens are severely underrepresented. This imbalance distorts the learning objective, hindering the model's ability to learn robust representations and impairing generalization. To address this, we introduce two solutions adapted from imbalanced learning: a class-balanced cross-entropy loss that re-weights the training signal based on token rarity, and semantic-aware label smoothing (SLS), a novel regularization technique that leverages token embedding similarity to create more meaningful soft targets. We validate our methods on MAGE for representation learning and MaskGIT for im age generation. Our experiments demonstrate that these techniques significantly enhance both discriminative and generative performance, evidenced by improved linear separability of the representation space and better mode coverage in image synthesis, respectively. This study underscores the necessity of mitigating token-class imbalance, offering scalable solutions that contribute to more robust and generalizable visual learning.
Publisher
Elsevier BV
ISSN
0925-2312
DOI
10.1016/j.neucom.2026.133408
URI
https://scholar.gist.ac.kr/handle/local/34014
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.