OAK

CDIM: Document Clustering by Discrimination Information Maximization

Metadata Downloads
Abstract
Ideally, document clustering methods should produce clusters that are semantically relevant and readily understandable as collections of documents belonging to particular contexts or topics. However, existing popular document clustering methods often ignore term-document corpus-based semantics while relying upon generic measures of similarity. In this paper, we present CDIM, an algorithmic framework for partitional clustering of documents that maximizes the sum of the discrimination information provided by documents. CDIM exploits the semantic that term discrimination information provides better understanding of contextual topics than term-to-term relatedness to yield clusters that are describable by their highly discriminating terms. We evaluate the proposed clustering algorithm using well-known discrimination/semantic measures including Relative Risk (RR), Measurement of Discrimination Information (MDI), Domain Relevance (DR), and Domain Consensus (DC) on twelve data sets to prove that CDIM produces high-quality clusters comparable to the best methods. We also illustrate the understandability and efficiency of CDIM, suggesting its suitability for practical document clustering. (C) 2015 Elsevier Inc. All rights reserved.
Author(s)
Hassan, Malik TahirKarim, AsimKim, Jeong-BaeJeon, Moongu
Issued Date
2015-09
Type
Article
DOI
10.1016/j.ins.2015.04.009
URI
https://scholar.gist.ac.kr/handle/local/14597
Publisher
ELSEVIER SCIENCE INC
Citation
INFORMATION SCIENCES, v.316, pp.87 - 106
ISSN
0020-0255
Appears in Collections:
Department of Electrical Engineering and Computer Science > 1. Journal Articles
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.