OAK

GIST Library Login

GIST Scholar College of Information and Computing Department of Electrical Engineering and Computer Science 1. Journal Articles

Leveraging contrastive learning for cross-modal(X-modal) person identification

Metadata Downloads

Author(s): Fatima, Unse; Khan, Zafran; Indyk, Piotr; Yow, Kin Choong; Jeon, Moongu

Type: Article

Citation: Neural Computing and Applications, v.38, no.4

Issued Date: 2026-02

Abstract: Person identification plays a critical role in event detection, person tracking, and public security. Over the years, various methods have been introduced for said purpose, like face identification/ recognition and person face retrieval. Typically, these methods are designed to classify a queried image into a specific identity within an image database. However, this approach encounters significant limitations when dealing with scenarios where only a textual description of the query or an attribute set is available. To address such challenges, we propose a contrastive learning based cross-modal (X-Modal) framework tailored for face recognition and retrieval against two modalities: textual queries and combined text-image queries. This framework excels in person identification tasks, even when presented with apparently changed reference images with modifications described textually. It leverages vision transformers and convolutional neural networks (ResNet-50) for visual feature extraction, while sentence transformers are employed for textual embeddings. Proposed framework is driven by deep contrastive learning principles enabling to adequately align visual and textual representations within a shared embedding space for retrieval tasks. The model is trained with improvisely designed Multiple Negative Ranking Loss (MNRL) which considers easy, semi-hard and hard difficulty level of query. In extensive experiments, X-Modal framework outperforms both quantitatively and qualitatively against state-of-the-art approaches. The proposed approach is tested on CelebA and LFW datasets achieving 73.8% and 73.3% for R@10 respectively on these datasets. It enhances the versatility of person identification techniques across various real-world scenarios, including video surveillance and suspect identification based on eyewitness descriptions. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2026.

Publisher: Springer Science and Business Media Deutschland GmbH

ISSN: 0941-0643

DOI: 10.1007/s00521-025-11693-6

URI: https://scholar.gist.ac.kr/handle/local/33867

Appears in Collections:: Department of Electrical Engineering and Computer Science > 1. Journal Articles

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.