题名特定领域网络信息处理关键技术研究
作者李蕾
学位类别博士
答辩日期2007-05-25
授予单位中国科学院声学研究所
授予地点声学研究所
关键词Web技术 信息提取 快速傅立叶变换 信息聚合 交互式知识系统
其他题名Research on Key Technologies for Domain-Specific Network Information Processing
学位专业信号与信息处理
中文摘要随着Internet和Web技术的飞速发展和普及,信息获取已经从手工获取、计算机获取,发展到网络获取。在计算机获取时期,特定领域信息处理主要表现为某领域的专业人员在特定领域的计算机软件辅助下进行信息获取、综合与检索。进入网络信息时代,各种各样的信息越来越集中发布到网络上,人们在同一个平台上可以互动与合作。于是,如何在以网络为中心的信息时代,让特定领域信息处理更具有专业准确性,功能更丰富,使用更便捷人性化,成为具有重大价值的研究课题。 为了在浩如烟海的网络世界中筛选出所需的信息,需要现代信息获取技术的主要工具——搜索引擎的帮助。但传统搜索引擎存在返回信息量过大、查询不准确、深度不够等问题,不能满足特定领域的信息处理的需求。为此,本文结合新媒体的特征和Web2.0的思想,研究了网络信息处理的关键技术,提出了一种面向特定领域(旅游信息服务)的信息收集、提取、查询、检索、聚合和展现的网络信息处理整体解决方案。该系统充分挖掘互联网信息中的地理位置特征,改善检索结果的组织,提高了查准率,丰富了信息的检索模式。特定领域网络信息处理系统注重有用信息的获取、管理和分享,重视用户体验、用户参与,能更好的适应网络信息形态的新特点,满足用户个性化的需求。 本文主要研究工作和成果如下: 1、设计了垂直搜索引擎系统模型及强结构化处理方法;通过分析“正文式”网页的页面结构特征,提出并实现了基于快速傅立叶变换(FFT)的网页有效信息提取算法。该算法采用窗口分段的方法,基于统计学原理和FFT求解最佳正文区间。实验结果表明,此方法能比较准确的提取“正文式”网页的有效信息;无须对具体网页结构进行分析即可提取网页正文内容,具有良好的通用性。 2、设计并实现了二维网络信息聚合模型,丰富了信息的检索模式,改善了检索结果的组织。通过架构描述、针对旅游领域的系统设计和实验示范表明:二维地理模型与网络信息聚合技术的结合,不仅丰富了网络信息聚合的模式,还优化了用户的交互体验,提高了检索效率。 3、构建了互动问答知识系统。该系统融合注重用户参与、用户建设、用户体验等Web2.0的思想,利用自然语言理解技术对知识库进行智能搜索,自动挑选最佳答案,从而帮助用户方便、快捷、准确地找到所提问题的答案,利用用户的评价反馈完善动态知识库。 4、把本文的研究成果,应用到一个商用平台——新媒体旅游增值服务网站的设计与实现之中。作为本文网络信息处理系统的研究与实践平台,该应用实例同时使我们更明确了研究方向和应用前景。该商用网站从2006年6月开始运营,日平均独立IP访问数大于3万(截止到2007年4月底)。该网站利用聚合、二维聚合以及垂直搜索等技术,提供了丰富有效的旅游信息和搜索功能;用户可以通过该网站用个人电脑及手机搜索景点、酒店、旅游线路、机票、火车票等旅游信息,并进行即时互动交流。
英文摘要With the rapid development and wide adaptation of the Internet and Web technologies, the methodology of information retrieval evolves through several phases: manual retrieval, computer aided retrieval, and network based retrieval. In the time of computer aided information retrieval, the domain-specific information processing is implemented as professional staff working on some domain-specific computer software to extract, integrate and retrieve the information. However, in the age of network-centric information retrieval, with massive information published on the Internet, people interact and cooperate on a common platform. Thus, in the network-centric information age, an important research topic is how to make domain-specific information processing more professionally accurate, to have richer functionalities, and to be used in an easier and more personalized manner. To fetch the information of interest from the massive network cyberspace, it is indispensable to make use of a network search engine, which is the key tool of the modern information retrieval technologies. However, the traditional search engine does not meet the requirement of domain-specific information processing because it gives overabundant results which are inaccurate in domain context. The key technologies of network information processing are studied in the thesis. By combining the features of the new media and Web 2.0, a complete domain-oriented (travel information service domain) solution of network information processing is proposed, including information acquisition, extraction, enquiry, retrieval, syndication and presentation. In this solution, the presentation of the retrieval results is improved; the search accuracy is improved; and the retrieval pattern is enriched. Compared to the traditional search engine, our domain-specific network information processing is more adaptive to the demand of user personalization. In addition, the useful information is easy to be obtained, managed and shared. The main contributions of this thesis are: 1. A vertical search engine system model is designed and a strong structured technology is proposed. A FFT-Based Extraction Algorithm of useful Web page information is proposed and implemented, based on the analysis of the structural characteristics of the “Content-Dominated” Web pages. By applying window-segmentation, statistics and FFT, our algorithm selects the best range and presents the results. The experimental results prove that this algorithm efficiently extracts the useful information of the “Content-Dominated” Web pages. This algorithm has a good feature of generalization because it extracts the content of a “Content-Dominated” Web page without analyzing its structure. 2. The model of Two Dimensional (2-D) Web Information Syndication (2DWIS) is designed and implemented. Under this model, the retrieval pattern is enriched, and the presentation of the retrieved results is improved. The system architecture, the system design for the travel service domain, and the demonstrations of experiment are presented. 2DWIS, which combines 2D geographic model and the Web information syndication technique, enriches the pattern of Web information syndication, optimizes the user interaction experience, and increases the retrieval effectiveness. 3. Interactive Question & Answer Knowledge System is presented, which merges the Web 2.0 by focusing on user involvement, user experience, user contributing, and so on. By applying natural language understanding technologies, the system searches relevant answers in the knowledge database intelligently, and then chooses the most acceptable answers, which help the users to find the answers to their questions conveniently, rapidly and accurately. At the same time, the dynamic knowledge database is improved by users' feedback. 4. The research achievements of this thesis were applied to the design and implementation of a commercial platform. As the experiment platform in our research work, the commercial platform gives us better understanding of our research direction and future applications as well. This website has been established since Jun. 2006, and has been visited by 30,000 distinct IPs every day on average (data on Apr. 2007). The technologies of information syndication, 2-D information syndication, and vertical search are applied in this platform, which supplies abundant and efficient travel information and search functionalities. With this website, users can search sights, hotels, travel itineraries, flight tickets, train tickets, and other travel information. Meanwhile, uses can interact with each other in real time using computers or mobile phones.
语种中文
公开日期2011-05-07
页码153
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/198]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
李蕾. 特定领域网络信息处理关键技术研究[D]. 声学研究所. 中国科学院声学研究所. 2007.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace