CORC  > 北京大学  > 信息科学技术学院
KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks
Wang Chenguang ; Song Yangqiu ; Li Haoran ; Zhang Ming ; Han Jiawei
刊名Proceedings. IEEE International Conference on Data Mining
2015
DOI10.1109/ICDM.2015.131
英文摘要As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta-paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.; EI; PubMed; 1015-1020; 2015
语种英语
内容类型期刊论文
源URL[http://ir.pku.edu.cn/handle/20.500.11897/434289]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Wang Chenguang,Song Yangqiu,Li Haoran,et al. KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks[J]. Proceedings. IEEE International Conference on Data Mining,2015.
APA Wang Chenguang,Song Yangqiu,Li Haoran,Zhang Ming,&Han Jiawei.(2015).KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks.Proceedings. IEEE International Conference on Data Mining.
MLA Wang Chenguang,et al."KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks".Proceedings. IEEE International Conference on Data Mining (2015).
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace