OAK

Leveraging contrastive learning for cross-modal(X-modal) person identification

Metadata Downloads
Author(s)
Fatima, UnseKhan, ZafranIndyk, PiotrYow, Kin ChoongJeon, Moongu
Type
Article
Citation
Neural Computing and Applications, v.38, no.4
Issued Date
2026-02
Abstract
Person identification plays a critical role in event detection, person tracking, and public security. Over the years, various methods have been introduced for said purpose, like face identification/ recognition and person face retrieval. Typically, these methods are designed to classify a queried image into a specific identity within an image database. However, this approach encounters significant limitations when dealing with scenarios where only a textual description of the query or an attribute set is available. To address such challenges, we propose a contrastive learning based cross-modal (X-Modal) framework tailored for face recognition and retrieval against two modalities: textual queries and combined text-image queries. This framework excels in person identification tasks, even when presented with apparently changed reference images with modifications described textually. It leverages vision transformers and convolutional neural networks (ResNet-50) for visual feature extraction, while sentence transformers are employed for textual embeddings. Proposed framework is driven by deep contrastive learning principles enabling to adequately align visual and textual representations within a shared embedding space for retrieval tasks. The model is trained with improvisely designed Multiple Negative Ranking Loss (MNRL) which considers easy, semi-hard and hard difficulty level of query. In extensive experiments, X-Modal framework outperforms both quantitatively and qualitatively against state-of-the-art approaches. The proposed approach is tested on CelebA and LFW datasets achieving 73.8% and 73.3% for R@10 respectively on these datasets. It enhances the versatility of person identification techniques across various real-world scenarios, including video surveillance and suspect identification based on eyewitness descriptions. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2026.
Publisher
Springer Science and Business Media Deutschland GmbH
ISSN
0941-0643
DOI
10.1007/s00521-025-11693-6
URI
https://scholar.gist.ac.kr/handle/local/33867
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.