OAK

GIST Library Login

Metadata Downloads

Abstract: Recent large vision-language multimodal models pre-trained with huge amount of image-text pairs show remarkable performances in downstream tasks. However, the multimodal pre-training has limitations in terms of resources and training time when it comes to obtaining new models that surpass existing models. To address these issues, we propose TransferCVLM, a method for efficiently constructing an advanced vision-language model without extensive multimodal pre-training. TransferCVLM integrates existing pre-trained unimodal models and a cross-modal fusion module into a combinative vision-language model (CVLM). For each task application, the CVLM is fine-tuned and further enhanced through knowledge distillation, where multimodal knowledge from a teacher vision-language model is transferred to the CVLM. We demonstrate that (1) the fine-tuned CVLM performs comparable to other vision-language models of similar size, that (2) the multimodal knowledge transfer consistently enhances the CVLM, and the knowledge-transferred CVLM outperforms the teacher multimodal model in most downstream tasks, and that (3) TransferCVLM can also be used for model compression when using small-size unimodal models, achieving better retainability than existing pre-training-based knowledge distillation methods. We estimate that the training of TransferCVLM takes only 6% of pre-training of other vision-language models. Our code is available at https://github.com/DMCB-GIST/TransferCVLM.

공개 및 라이선스

qrcode

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.