OAK

Enhancing vision-language models through pre-training-free knowledge fusion with TransferCVLM

Metadata Downloads
Author(s)
Choi, DonghaKim, Jung-jaeLee, Hyunju
Type
Article
Citation
KNOWLEDGE-BASED SYSTEMS, v.327
Issued Date
2025-10
Abstract
Recent large vision-language multimodal models pre-trained with huge amount of image-text pairs show remarkable performances in downstream tasks. However, the multimodal pre-training has limitations in terms of resources and training time when it comes to obtaining new models that surpass existing models. To address these issues, we propose TransferCVLM, a method for efficiently constructing an advanced vision-language model without extensive multimodal pre-training. TransferCVLM integrates existing pre-trained unimodal models and a cross-modal fusion module into a combinative vision-language model (CVLM). For each task application, the CVLM is fine-tuned and further enhanced through knowledge distillation, where multimodal knowledge from a teacher vision-language model is transferred to the CVLM. We demonstrate that (1) the fine-tuned CVLM performs comparable to other vision-language models of similar size, that (2) the multimodal knowledge transfer consistently enhances the CVLM, and the knowledge-transferred CVLM outperforms the teacher multimodal model in most downstream tasks, and that (3) TransferCVLM can also be used for model compression when using small-size unimodal models, achieving better retainability than existing pre-training-based knowledge distillation methods. We estimate that the training of TransferCVLM takes only 6% of pre-training of other vision-language models. Our code is available at https://github.com/DMCB-GIST/TransferCVLM.
Publisher
ELSEVIER
ISSN
0950-7051
DOI
10.1016/j.knosys.2025.113986
URI
https://scholar.gist.ac.kr/handle/local/31691
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.