CORC  > 北京大学  > 信息科学技术学院
A comparative study on feature weight in text categorization
Deng, ZH ; Tang, SW ; Yang, DQ ; Zhang, M ; Li, LY ; Xie, KQ
2004
英文摘要Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F-1 and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf*CRF but is not as effective as tf*CHI and tf*OddsRatio.; Computer Science, Information Systems; Computer Science, Software Engineering; Computer Science, Theory & Methods; SCI(E); CPCI-S(ISTP); 10
语种英语
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/292265]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Deng, ZH,Tang, SW,Yang, DQ,et al. A comparative study on feature weight in text categorization. 2004-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace