OAK

GIST Library Login

GIST Scholar College of Information and Computing Department of AI Convergence 3. Theses(Master)

NCWP:Unsupervised Semantic Embedding Alignment Post-processing for Improving RAG in Language Models

Metadata Downloads

Author(s): GangHo Lee

Type: Thesis

Degree: Master

Department: 정보컴퓨팅대학 AI융합학과

Advisor: Lee, Yong-Gu

Abstract: This paper investigates structural limitations of large language model embeddings in Retrieval-Augmented Generation (RAG) systems, focusing on anisotropy and high dimensionality. When most variance is concentrated in a few principal directions, cosine similarity becomes distorted; at the same time, thousand-dimensional vectors incur substantial memory, indexing, and latency costs. Classical post-processing methods such as mean-centering, PCA/LPP, whitening, and random projection can partially restore isotropy by rescaling variances, but they do not explicitly learn to preserve neighborhood structure and rankings without labels. Contrastive fine-tuning of encoders can improve retrieval, yet it requires updating the whole model and is expensive to deploy.

To address this, the paper proposes NCWP (Neighbour-Contrastive Whitening Projection), a purely post-hoc method that keeps the backbone language model frozen and learns only a single linear projection 𝑊. NCWP first applies ZCA-shrink whitening to obtain an approximately isotropic initial space, then constructs positive pairs and hard negatives from k-nearest neighbors and trains 𝑊 with an InfoNCE-style contrastive loss. Output covariance regularization, orthogonal regularization, and periodic QR retraction are used to prevent collapse and maintain isotropy even at low dimensions. Experiments on a synthetic sentence corpus with STS-based labels (STS-Embed) and traditional IR-based labels (U2/U3 from TF-IDF, Jaccard, BM25) show that NCWP consistently outperforms PCA-Whitening, LPP, and Random Projection in mAP and nDCG@10, with particularly large gains for low dimensions (𝑟 ≤64). While the base model exhibits anisotropy_mean around 0.39, NCWP reduces this value to near zero across all tested dimensions, and self_sim decreases, indicating stronger separation between non-matching sentences. At the same time, dimensionality reduction with NCWP reduces latency and increases QPS by up to 2–3 times, yielding a better quality–efficiency Pareto trade-off than existing post-processing methods. These results demonstrate that NCWP is a practical embedding post-processing strategy for improving RAG retrieval quality without modifying or fine-tuning the underlying language model.

URI: https://scholar.gist.ac.kr/handle/local/33792

Fulltext: http://gist.dcollection.net/common/orgView/200000952386

Alternative Author(s): 이강호

Appears in Collections:: Department of AI Convergence > 3. Theses(Master)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.