CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名面向大规模互联网数据的细粒度观点挖掘方法研究
作者徐立恒
学位类别工学博士
答辩日期2014-05-29
授予单位中国科学院大学
授予地点中国科学院自动化研究所
导师赵军
关键词观点挖掘 观点倾向性分析 评价词抽取 评价对象抽取 产品属性词挖掘 Opinion Mining Sentiment Polarity Analysis Opinion Word Extraction Opinion Target Extraction Product Feature Mining
其他题名Fine-Grained Opinion Mining Methods for Large-Scale Online Reviews
学位专业计算机应用技术
中文摘要随着移动互联网的迅速扩张,网络购物大大地提升了人们的生活质量。在此背景下,许多电子商务网站提供了产品评价平台,以方便用户分享产品使用经验以及对产品的满意程度作出评价。这些评价语无论对于消费者还是企业都有重要参考价值。然而,由于评价语规模庞大,使得人工阅读方式面临许多困难。因此,自动观点挖掘系统应运而生。 观点挖掘,主要研究自动分析产品评价文本的方法,总结用户对产品各个功能的观点倾向。本文需要挖掘的观点信息,主要包括评价词(表达用户观点倾向的词)和评价对象(通常是产品的功能或属性)两部分。传统观点挖掘方法主要依靠依存句法分析,通过捕捉评价词和评价对象之间的修饰关系,抽取用户表达的观点信息。然而,基于句法的观点挖掘方法存在许多问题。本文主要针对现有基于句法分析的方法的缺点,研究面向大规模互联网评价文本的自动评价词和评价对象抽取方法,具体研究内容与成果如下: (1) 本文提出一个两步走的算法,改进传统基于句法分析的观点挖掘方法的部分缺点。传统观点挖掘方法常依赖许多句法模板,由于不同模板准确度不同,导致部分低质量模板容易引入许多噪声词。针对该问题,本文在算法的第一步,提出将句法模板融入到一个评价关系图,并为每一个模板估算一个置信度,使得低质量的模板得到低置信度。另一方面,传统方法倾向于使用词频对候选词排序,其缺点是无法过滤高频噪音词,且容易丢失低频词。针对该问题,本文在算法的第二步,使用一个半监督二元分类器对评价对象列表进行过滤,从而使算法不依赖于词频。实验证明,本文提出的第一步方法有效提升了准确率,第二步方法有效降低了词频的不良影响。 (2) 本文提出使用单语词对齐模型取代句法分析工具。现有句法分析工具在处理复杂的互联网评价语时,其准确度往往不能令人满意。针对该问题,本文提出使用单语词对齐模型,通过无监督词共现统计方式,模拟评价词与评价对象之间的评价修饰关系。相比于基于句法的方法,词对齐模型可有效减少分析口语语料时的错误修饰关系,同时有效提升系统的召回率。但是,无监督词对齐模型容易受到训练数据规模不足的影响。据此,本文进一步提出一个基于半监督词对齐模型的观点挖掘算法,将部分可靠依存句法关系与词对齐模型融合。实验证明该方法有效提升了模型在处理小规模语料时的效果。 (3) 本文提出利用词向量学习方法取代句法分析工具。现有基于句法的方法将词看作离散的变量,这样的方式易出现数据稀疏性问题。针对该问题,本文引入词向量学习方法取代句法分析捕捉上下文语义。由于语义相似的词拥有相似的词向量,因此可以有效地降低数据稀疏性问题带来的不良影响。同时,本文还引入词向量距离衡量词之间的语义相似度关系,取代传统基于图的方法中的模板-词共现关系。实验证明,在产品属性词抽取过程中,词向量距离显著优于模板-词共现关系。
英文摘要With the rapid growth of mobile internet, online shopping has greatly improved life for consumers. Against this background, many e-commerce websites provide online review platforms for consumers to share their purchase experiences and opinions on products. These reviews are of great value to both consumers and business organizations. However, manually reading throughout large scales of review texts is a very arduous task. Therefore, automatic opinion mining system emerges. Generally, opinion mining systems make summarizations of consumers' opinions through automatic analysis on review texts. In this thesis, we mainly focus on mining opinion words (which refer to those terms indicating sentiment polarities) and opinion targets (which are often attributes or functions of products). Conventional opinion mining methods often rely on employing syntactic dependency parsing to capture modified relations between opinion words and opinion targets, which may have many limitations. This thesis aims to provide several opinion mining methods to overcome shortcomings of conventional syntax-based opinion mining systems. The main contents and contributions of this thesis include: (1) This thesis proposes a two-stage method to improve conventional syntax-based opinion mining methods. Previous works often use many syntactic patterns to mine opinion words and opinion targets. However, some patterns are of low quality, which may introduce many noise terms. To alleviate this issue, we incorporate syntactic patterns in a Sentiment Graph and apply random walking on the graph to estimate confidence of patterns. In this way, low-quality patterns will have low confidence, so as to improve accuracy. On another hand, previous works tend to rank candidates by term frequencies, this may introduce high-frequency noise terms and lose low-frequency opinion terms. To solve this problem, we employ a semi-supervised binary classier to refine opinion targets, which does not rely on term frequencies to rank candidates. Experimental results show that the first stage effectively improves precision and the second stage significantly reduces adverse effects of term frequencies. (2) This thesis introduces a monolingual word alignment model, which substitutes syntactic parser to capture opinion relations. Current syntactic parsers can easily suffer from informal expressions in online reviews. To tackle this problem, instead of using syntactic parsers, this thesis employs an unsupervised monol...
语种中文
其他标识符201118014629094
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/6643]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
徐立恒. 面向大规模互联网数据的细粒度观点挖掘方法研究[D]. 中国科学院自动化研究所. 中国科学院大学. 2014.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace