题名化学结构信息的存储和检索技术研究
作者孙婉怡
学位类别硕士
答辩日期2009-05-31
授予单位中国科学院过程工程研究所
授予地点过程工程研究所
导师温浩
关键词分子结构数据库 存储 结构检索 VF2算法 GMA算法 偏序
其他题名Research on Chemical Structure Information Storage and Search
学位专业应用化学
中文摘要本文根据工程化学数据库(ECDB)中分子结构数据库的功能需求,在已有工作的基础上,对化学结构信息的存储、检索做出了改进和完善。 现有的分子结构数据库是以SQL Server 2000为管理平台,将每个化学结构的信息以字符串的形式逐条存储在数据库表中,该字符串的内容结构设计精巧,能使存储空间比相应的MOL文件节省70%以上;但是从该字符串解析为算法可用的化学结构信息的过程较为复杂,有碍数据库运行效率的提高。本文利用标准C++在读写二进制流时对数据类型的自动识别和转换功能,将结构信息直接从算法变量写入二进制文件,读取时将文件内容直接读到结构体变量中。对本文使用的一个含6万个化学结构的测试集进行测试,能够节省47.2%的存储空间和84.4%的读取时间;出于提高子结构检索效率的考虑,将原子种类和键型单独写入一个二进制文件,只进行子结构检索时,读取时间能节省93%。 针对分子二维子结构检索问题,本文对图同构算法中具有代表性的VF2算法和GMA算法进行了比较和分析。VF2算法的数据结构设计精巧,能有效地降低内存开销,但其在进行图匹配时没有保存提问结构的遍历偏序,所以存在大量重复计算,使匹配效率受到影响。GMA算法则利用了偏序不变的性质,采取预先计算偏序的策略指导图匹配的过程。本文将GMA算法的偏序行走策略和VF2算法的数据结构相结合,用标准C++语言实现了子结构检索算法的改进,同时保证了检索的正确性。通过分析改进算法对VF2算法的检索耗时下降率,证实了改进算法的检索效率比VF2算法有普遍大幅度的提升。对本文使用的一个含2万个化学结构的测试集进行检索,VF2算法的耗时平均为60ms,改进算法的耗时平均为22ms,检索时间节省了60%以上。
英文摘要This paper has made some improvement to the current chemical structure information storage and search functions, based on the requirements of MSD (Molecular Structural Database) in ECDB (Engineering Chemical DataBase). Currently, MSD is based on SQL Server 2000, where the information of each chemical structure is saved in a well designed string that costs very small space for storage (about 70% storage space cut down). However converting the string back to the chemical structure information is a tired job which often reduces the database query efficiency. This paper makes good use of the automatic data type recognition and conversion in C++ Binary Streams when it deals with information storage. Structure information is written into or is read from a binary file without being modification, which can cut down the storage space by 47.2%, and the reading time is 84.4% shorter than the previous one. Especially when the atom type and bond type are written into a separate file, the time savings can be up to 93%. Further more, two typical graph isomorphism algorithms VF2 and GMA are analyzed and discussed for 2-dimentional molecular structure search. VF2 has a very low memory requirement due to its specialized data structure, but there is still quite a lot of double counting for VF2 doesn’t save the partial order obtained by traveling the query structure. On the contrary, GMA takes advantage of the invariance of partial order and applies a partial order pre-computing strategy to the graph matching. In this paper, standard C++ language is used to implement algorithms with the advantages of VF2 on traversal rules and data structure, and those of GMA on partial order walking strategy. The improved algorithms operate correctly and exhibit much higher retrieval efficiency than VF2 does, by analyzing the time decent rate. To complete a substructure search within 20 thousand chemical structures, averagely VF2 algorithm costs 60ms and the improved ones cost 22ms, which means the time savings is more than 60%.
语种中文
公开日期2013-09-16
页码69
内容类型学位论文
源URL[http://ir.ipe.ac.cn/handle/122111/1300]  
专题过程工程研究所_研究所(批量导入)
推荐引用方式
GB/T 7714
孙婉怡. 化学结构信息的存储和检索技术研究[D]. 过程工程研究所. 中国科学院过程工程研究所. 2009.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace