CORC  > 软件研究所  > 基础软件国家工程研究中心  > 学位论文
题名基于I-Match算法的垃圾邮件过滤研究
作者招立军
学位类别硕士
答辩日期2008-06-04
授予单位中国科学院研究生院
授予地点中国科学院软件研究所
导师淮晓永
关键词计算机软件,计算机软件::操作系统与操作环境
其他题名Research on Spam Filtering Based on I-Match Algorithm
学位专业计算机应用技术
中文摘要电子邮件(Electronic Mail,E-Mail)是目前使用最广泛的互联网应用。随着互联网络以惊人的速度增长,电子邮件成为发布恶意信息的一个重要途径,垃圾邮件已经成为危害互联网络的最大毒瘤。针对方式多样的垃圾邮件技术,垃圾邮件过滤系统往往也需要综合多种过滤技术以提高系统的有效性。其中摘要技术已经成为重要的垃圾邮件过滤方法之一:通过摘要技术判断一个邮件和已知垃圾邮件的相似度,从而对邮件进行分类。判断一个垃圾邮件过滤算法是否有效,要综合考虑算法的召回率、准确率以及时间性能。I-Match算法通过摘要值的精确匹配来判断两个邮件文本内容是否相似,算法在效率方面表现突出。但是I-Match算法在实际的应用中还存在很多问题,其中包括字典生成制约算法的性能以及面对攻击时算法表现出的鲁棒性不足。因此,优化算法的字典生成过程以及提高算法的鲁棒性成了算法应用于实际系统的两个重要问题。本文的主要工作包含以下内容: 对垃圾邮件进行相似性分析,包括垃圾邮件相似性的起因、垃圾邮件在时间和内容两方面所表现出的相似性特征。垃圾邮件体现出的相似性特征是使用摘要算法进行垃圾邮件过滤的必要条件之一。 改进I-Match算法的字典生成过程。提出利用特征的互信息作为特征选择依据改进字典生成过程,并对比几种不同的特征选择方式对算法性能的影响。 分析I-Match算法的鲁棒性以及几种I-Match改进算法对算法鲁棒性的提升,在实际的邮件语料上对各种改进算法进行评测,并综合分析各个算法的实用性。 完成了KSpam系统原型,以插件的形式综合多种邮件过滤方法,并给出了I-Match算法在KSpam系统中的实现方案。同时,系统实现了一种新式的邮件自动回收功能,有效减少邮件管理员的邮件语料收集工作。
索取号暂无
英文摘要E-Mail (electronic mail) is one of the most popular Internet applications. As the Internet growing in an amazing rate, the E-Mail has become a significant source of posting malicious information. The spam has become the tumor that harms the health of Internet. In order to improve the effectiveness of filter out various spam, comprehensive of many filtration technologies is required. The important one of the technologies is digest based technology which classifies the spam by using the digest based technology to compare an E-Mail with another known spam. To judge the effectiveness of a spam-filter algorithm, one needs to consider the recall rate, precision rate and time performance. Though I-Match algorithm is efficient by exact match the digest value, there are still many problems in the practical applications, including algorithm overhead brought by the lexicon-generation process and the lack of robust when facing the spam-attacking. So optimization of the lexicon-generation process and improvement of the algorithm’s robust are the important problems when putting I-Match into practice. The major contributions of this paper are:  Analyze the similarity of spam, including the causes of the similarity, time and content specific similarity features. The similarity feature is the premise of spam analysis when using digest algorithm.  Improve the lexicon-generation process of I-Match algorithm. Using mutual information of features to improve the feature selection of lexicon-generation process, and compare the performances of different algorithms.  Analyze the robustness of the I-Match algorithm as well as robustness improvement by refining the algorithm. Evaluate the effectiveness of several improved algorithm in the actual e-mail corpus sets and comprehensive analyze practicality of these algorithms.  Complete the KSpam prototype system which integrated several spam filtering methods in the form of plug-in. Implement I-Match algorithm in real system. At the same time, a new automatic email recycle system which can effectively reduce e-mail corpus collection task is presented.
公开日期2011-03-17
分类号暂无
内容类型学位论文
源URL[http://124.16.136.157/handle/311060/6516]  
专题软件研究所_基础软件国家工程研究中心_学位论文
推荐引用方式
GB/T 7714
招立军. 基于I-Match算法的垃圾邮件过滤研究[D]. 中国科学院软件研究所. 中国科学院研究生院. 2008.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace