Enhancing vision-language models through pre-training-free knowledge fusion with TransferCVLM
- Author(s)
- Choi, Dongha; Kim, Jung-jae; Lee, Hyunju
- Type
- Article
- Citation
- KNOWLEDGE-BASED SYSTEMS, v.327
- Issued Date
- 2025-10
- Abstract
- Recent large vision-language multimodal models pre-trained with huge amount of image-text pairs show remarkable performances in downstream tasks. However, the multimodal pre-training has limitations in terms of resources and training time when it comes to obtaining new models that surpass existing models. To address these issues, we propose TransferCVLM, a method for efficiently constructing an advanced vision-language model without extensive multimodal pre-training. TransferCVLM integrates existing pre-trained unimodal models and a cross-modal fusion module into a combinative vision-language model (CVLM). For each task application, the CVLM is fine-tuned and further enhanced through knowledge distillation, where multimodal knowledge from a teacher vision-language model is transferred to the CVLM. We demonstrate that (1) the fine-tuned CVLM performs comparable to other vision-language models of similar size, that (2) the multimodal knowledge transfer consistently enhances the CVLM, and the knowledge-transferred CVLM outperforms the teacher multimodal model in most downstream tasks, and that (3) TransferCVLM can also be used for model compression when using small-size unimodal models, achieving better retainability than existing pre-training-based knowledge distillation methods. We estimate that the training of TransferCVLM takes only 6% of pre-training of other vision-language models. Our code is available at https://github.com/DMCB-GIST/TransferCVLM.
- Publisher
- ELSEVIER
- ISSN
- 0950-7051
- DOI
- 10.1016/j.knosys.2025.113986
- URI
- https://scholar.gist.ac.kr/handle/local/31691
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.