化学结构信息的存储和检索技术研究

CORC > 过程工程研究所 > 中国科学院过程工程研究所 > 研究所（批量导入）

题名	化学结构信息的存储和检索技术研究
作者	孙婉怡
学位类别	硕士
答辩日期	2009-05-31
授予单位	中国科学院过程工程研究所
授予地点	过程工程研究所
导师	温浩
关键词	分子结构数据库存储结构检索 VF2算法 GMA算法偏序
其他题名	Research on Chemical Structure Information Storage and Search
学位专业	应用化学
中文摘要	本文根据工程化学数据库（ECDB）中分子结构数据库的功能需求，在已有工作的基础上，对化学结构信息的存储、检索做出了改进和完善。现有的分子结构数据库是以SQL Server 2000为管理平台，将每个化学结构的信息以字符串的形式逐条存储在数据库表中，该字符串的内容结构设计精巧，能使存储空间比相应的MOL文件节省70%以上；但是从该字符串解析为算法可用的化学结构信息的过程较为复杂，有碍数据库运行效率的提高。本文利用标准C++在读写二进制流时对数据类型的自动识别和转换功能，将结构信息直接从算法变量写入二进制文件，读取时将文件内容直接读到结构体变量中。对本文使用的一个含6万个化学结构的测试集进行测试，能够节省47.2%的存储空间和84.4%的读取时间；出于提高子结构检索效率的考虑，将原子种类和键型单独写入一个二进制文件，只进行子结构检索时，读取时间能节省93%。针对分子二维子结构检索问题，本文对图同构算法中具有代表性的VF2算法和GMA算法进行了比较和分析。VF2算法的数据结构设计精巧，能有效地降低内存开销，但其在进行图匹配时没有保存提问结构的遍历偏序，所以存在大量重复计算，使匹配效率受到影响。GMA算法则利用了偏序不变的性质，采取预先计算偏序的策略指导图匹配的过程。本文将GMA算法的偏序行走策略和VF2算法的数据结构相结合，用标准C++语言实现了子结构检索算法的改进，同时保证了检索的正确性。通过分析改进算法对VF2算法的检索耗时下降率，证实了改进算法的检索效率比VF2算法有普遍大幅度的提升。对本文使用的一个含2万个化学结构的测试集进行检索，VF2算法的耗时平均为60ms，改进算法的耗时平均为22ms，检索时间节省了60%以上。
英文摘要	This paper has made some improvement to the current chemical structure information storage and search functions, based on the requirements of MSD (Molecular Structural Database) in ECDB (Engineering Chemical DataBase). Currently, MSD is based on SQL Server 2000, where the information of each chemical structure is saved in a well designed string that costs very small space for storage (about 70% storage space cut down). However converting the string back to the chemical structure information is a tired job which often reduces the database query efficiency. This paper makes good use of the automatic data type recognition and conversion in C++ Binary Streams when it deals with information storage. Structure information is written into or is read from a binary file without being modification, which can cut down the storage space by 47.2%, and the reading time is 84.4% shorter than the previous one. Especially when the atom type and bond type are written into a separate file, the time savings can be up to 93%. Further more, two typical graph isomorphism algorithms VF2 and GMA are analyzed and discussed for 2-dimentional molecular structure search. VF2 has a very low memory requirement due to its specialized data structure, but there is still quite a lot of double counting for VF2 doesn’t save the partial order obtained by traveling the query structure. On the contrary, GMA takes advantage of the invariance of partial order and applies a partial order pre-computing strategy to the graph matching. In this paper, standard C++ language is used to implement algorithms with the advantages of VF2 on traversal rules and data structure, and those of GMA on partial order walking strategy. The improved algorithms operate correctly and exhibit much higher retrieval efficiency than VF2 does, by analyzing the time decent rate. To complete a substructure search within 20 thousand chemical structures, averagely VF2 algorithm costs 60ms and the improved ones cost 22ms, which means the time savings is more than 60%.
语种	中文
公开日期	2013-09-16
页码	69
内容类型	学位论文
源URL	[http://ir.ipe.ac.cn/handle/122111/1300]
专题	过程工程研究所_研究所（批量导入）
推荐引用方式 GB/T 7714	孙婉怡. 化学结构信息的存储和检索技术研究[D]. 过程工程研究所. 中国科学院过程工程研究所. 2009.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们