A data cleaning method for heterogeneous attribute fusion and record linkage
Zhu, Hui-Juan 1,2,3; Jiang, Tong-Hai 1,3; Wang, Yi1,3; Cheng, Li1,3; Ma, Bo1,3; Zhao, Fan 1,3
刊名International Journal of Computational Science and Engineering
2019
卷号19期号:3页码:311-324
ISSN号1742-7185
英文摘要

In big data era, massive heterogeneous data are generated from various data sources, the cleaning of dirty data is critical for reliable data analysis. Existing rule-based methods are generally developed in single data source environment, issues like data standardisation and duplication detection for different data type attributes, are not fully studied. In order to address these challenges, we introduce a method based on dynamic configurable rules which can integrate data detection, modification and transformation together. Secondly, we propose a type-based blocking and a varying window size selection mechanism based on classic sorted-neighbourhood algorithm. We present a reference implementation of our method in a real-life data fusion system and validate its effectiveness and efficiency using recall and precision metrics. Experimental results indicate that our method is suitable in the scenario of multiple data sources with heterogeneous attribute properties.

内容类型期刊论文
源URL[http://ir.xjipc.cas.cn/handle/365002/7792]  
专题新疆理化技术研究所_多语种信息技术研究室
作者单位1.Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi
2.100049, China
3.University of Chinese Academy of Sciences, Beijing
4.830011, China
5.Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi
推荐引用方式
GB/T 7714
Zhu, Hui-Juan 1,2,3,Jiang, Tong-Hai 1,3,Wang, Yi1,3,et al. A data cleaning method for heterogeneous attribute fusion and record linkage[J]. International Journal of Computational Science and Engineering,2019,19(3):311-324.
APA Zhu, Hui-Juan 1,2,3,Jiang, Tong-Hai 1,3,Wang, Yi1,3,Cheng, Li1,3,Ma, Bo1,3,&Zhao, Fan 1,3.(2019).A data cleaning method for heterogeneous attribute fusion and record linkage.International Journal of Computational Science and Engineering,19(3),311-324.
MLA Zhu, Hui-Juan 1,2,3,et al."A data cleaning method for heterogeneous attribute fusion and record linkage".International Journal of Computational Science and Engineering 19.3(2019):311-324.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace