A Similar Duplicate Record Detection Algorithm for
 Big Data Based on MapReduce

宋人杰1，余通1，陈宇红2，陈宇阳2，夏滨2

doi:10.16183/j.cnki.jsjtu.2018.02.014

Journal of Shanghai Jiaotong University >

2018 , Vol. 52 >Issue 2: 214 - 221

DOI: https://doi.org/10.16183/j.cnki.jsjtu.2018.02.014

A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce

宋人杰1，余通1，陈宇红2，陈宇阳2，夏滨2

Expand

1. College of Information Engineering, Northeast Electric Power University, Jilin 132012, China; 2. State Grid Jilin Power Supply Company, Jilin 132000, China

Online published: 2018-03-01

Fold

Abstract

In view of the characteristics of multi-source, high dimension and large volume of big data, traditional algorithms have been unable to effectively complete the similar duplicate records detection for big data, therefore, a new parallel algorithm MP-SYYT for the detection of similar duplicate records of big data in the cloud environment is put forward. Firstly, Institute of computing technology chinese lexical analysis system (ICTCLAS) word segmentation technology, Delphi method and team frequency Inverse document frequency (TF-IDF) algorithm are used to improve the traditional SimHash algorithm, and these methods effectively solve the insufficiency of the traditional one, such as the low extraction speed, the imprecision of the keywords, and the low accuracy on weight calculation. Secondly, the inversed file retrieval algorithm is used to optimize the traditional SimHash algorithm to improve the matching efficiency of similar duplicate records. Finally, the Map function and the Reduce function based on the improved SimHash algorithm are defined on a cloud platform to realize the parallel detection of big data and the direct output of duplicate records in cloud environment with MapReduce model, and an experimental analysis about the multi-source measured data is made on a Hadoop platform. The results show that MP-SYYT is an efficient and accurate algorithm with good scalability and acceleration ratio, and it is suitable for similar duplicate record detection of big data.

Key words： cloud environment; big data; similar duplicate records; parallel detection; redundant identification

Cite this article

宋人杰1，余通1，陈宇红2，陈宇阳2，夏滨2 . A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce[J]. Journal of Shanghai Jiaotong University, 2018 , 52(2) : 214 -221 . DOI: 10.16183/j.cnki.jsjtu.2018.02.014

References

［1］李建中, 王宏志, 高宏. 大数据可用性的研究进展［J］. 软件学报, 2016, 27(7): 1605-1625. LI Jianzhong, WANG Hongzhi, GAO Hong. State-of-the-art of research on big data usability［J］. Journal of Software, 2016, 27(7): 1605-1625. ［2］曲朝阳, 孙立擎, 许助庆, 等. 基于B+树的电力大数据分布式索引［J］. 东北电力大学学报, 2016, 36(5): 80-85. QU Zhaoyang, SUN Liqing, XUN Shaoqing, et al. Power big data distributed index based on B+ tree and inverted index［J］. Journal of Northeast Dianli University, 2016, 36(5): 80-85. ［3］LI A, JIWU S, MINGQINNG L. Data deduplication techniques［J］. Journal of Software, 2010, 21(5): 916-929. ［4］陈明. 桥梁预警系统的数据预处理［J］. 上海交通大学学报, 2012, 46(10):1680-1685. CHEN Ming. Data preprocess for bridge damage alarming system［J］. Journal of Shanghai Jiao Tong University, 2012, 46(10): 1680-1685. ［5］崔霞, 施光林, 沈伟. 基于分组数据处理神经网络气动人工肌肉迟滞特性［J］. 上海交通大学学报, 2012, 46(6): 931-935. CUI Xia, SHI Ganglin, SHEN Wei. Study on hysteresis of pneumatic artificial muscle based on group method of data handling neural network［J］. Journal of Shanghai Jiao Tong University, 2012, 46(6): 931-935. ［6］李建中, 刘显敏. 大数据的一个重要方面: 数据可用性［J］. 计算机研究与发展, 2013, 50(6): 1147-1162. LI Jianzhong, LIU Xianmin. An important aspect of big data: Data usability［J］. Journal of Computer Research and Development, 2013, 50(6): 1147-1162. ［7］HERNANDEZ M A, STOLFO S J. Real-world data is dirty: Data cleansing and the merge/purge problem［J］. Data Mining & Knowledge Discovery, 1998, 2(1): 9-37. ［8］CHAUDHURI S, GANTI V, KAUSHIK S, et al. Leveraging constraints for deduplication: US 8204866 ［P］. 2012-06-19 ［2016-09-01］. ［9］FAN W F, LI J Z, LUO J Z, et al. Incremental graph pattern matching［C］∥Proceedings of ACM SIGMOD. New York: ACM, 2011: 925-936. ［10］FAN W F, LI J Z, WANG X, et al. Query preserving graph compression［C］∥Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2012: 157-168. ［11］CHAUDUTI S, GANTI V, MOTWANI R. Robust identification of fuzzy duplicates［C］∥21st International Conference on ICDE 2005. Tokoyo, Japan: IEEE, 2005: 865-876. ［12］GUHA S, KOUDAS N, MARATHE A, et al. Merging the results of approximate match operations［J］. Journal of Very Large Data Bases, 2004, 30(8): 636-647. ［13］CHEN Z, KALASHNIKOV D V, MEHROTRA S. Adaptive graphical approach to entity resolution［C］∥Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. New York: ACM, 2007: 204-213. ［14］AUGSTEN N, BHLEN M H, DYRESON C, et al. Approximate joins for data-centric XML［C］∥24th International Conference on ICDE 2008. Cancún, Mexico: IEEE, 2008: 814-823. ［15］FERREIRA C L W, BUCHMANN E, HM K. Finding misplaced items in retail by clustering RFID data［C］∥Proceedings of the 13th Int Conf on Extending Database Technology. New York: ACM, 2010: 501-512. ［16］MANKU G S, JAIN A, DAS S A. Detecting near-duplicates for web crawling［C］∥Proceedings of the 16th international conference on World Wide Web. New York: ACM, 2007: 141-150. ［17］BUYRUKBILEN S, BAKIRAS S. Secure similar document detection with SimHash［M］. Berlin: Springer, 2014: 61-75. ［18］COHEN W W. Data integration using similarity joins and a word-based information representation language［J］. Acm Transactions on Information Systems, 2010,18(3): 288-321. ［19］曲朝阳, 陈帅, 杨帆, 等. 基于云计算技术的电力大数据预处理属性约简方法［J］. 电力系统自动化, 2014, 38(8):67-71. QU Zhaoyang, CHEN Shuai, YANG Fan, et al. An attribute reducing method for electric power big data preprocessing based on cloud computing technology［J］. Automation of Electric Power Systems, 2014, 38(8): 67-71. ［20］池子文, 张丰, 杜震洪, 等. 云环境下基于预分片的遥感数据并行重采样方法［J］. 上海交通大学学报, 2014, 48(11): 1627-1632. CHI Ziwen, ZHANG Feng, DU Zhenhong, et al. Parallel resampling method of remote sensing data based on pre-partitioning for cloud computing［J］. Journal of Shanghai Jiao Tong University, 2014, 48(11): 1627-1632. ［21］曲朝阳, 朱莉, 张士林. 基于Hadoop 的广域测量系统数据处理［J］. 电力系统自动化, 2013, 37(4): 92-97. QU Zhaoyang, ZHU Li, ZHANG Shilin. Data processing of hadoop-based wide area measurement system［J］. Automation of Electric Power Systems, 2013, 37(4): 92-97. ［22］KPCKE H, THOR A, RAHM E. Evaluation of entity resolution approaches on real-world match problems［C］∥Proceedings of the VLDB Endowment. Berlin: Springer, 2010, 3(1): 484-493. ［23］ARASU A, RE C, SUCIU D. Large-scale deduplication with constraints using dedupalog［C］∥IEEE 25th International Conference on Data Engineering. New York: IEEE, 2009: 952-963.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References