Journal of Shanghai Jiaotong University ›› 2018, Vol. 52 ›› Issue (2): 214-221.doi: 10.16183/j.cnki.jsjtu.2018.02.014

Previous Articles     Next Articles

A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce

SONG Renjie1,YU Tong1,CHEN Yuhong2,CHEN Yuyang2,XIA Bin2   

  1. 1. College of Information Engineering, Northeast Electric Power University, Jilin 132012, China; 2. State Grid Jilin Power Supply Company, Jilin 132000, China
  • Online:2018-03-01 Published:2018-03-01

Abstract: In view of the characteristics of multi-source, high dimension and large volume of big data, traditional algorithms have been unable to effectively complete the similar duplicate records detection for big data, therefore, a new parallel algorithm MP-SYYT for the detection of similar duplicate records of big data in the cloud environment is put forward. Firstly, Institute of computing technology chinese lexical analysis system (ICTCLAS) word segmentation technology, Delphi method and team frequency Inverse document frequency (TF-IDF) algorithm are used to improve the traditional SimHash algorithm, and these methods effectively solve the insufficiency of the traditional one, such as the low extraction speed, the imprecision of the keywords, and the low accuracy on weight calculation. Secondly, the inversed file retrieval algorithm is used to optimize the traditional SimHash algorithm to improve the matching efficiency of similar duplicate records. Finally, the Map function and the Reduce function based on the improved SimHash algorithm are defined on a cloud platform to realize the parallel detection of big data and the direct output of duplicate records in cloud environment with MapReduce model, and an experimental analysis about the multi-source measured data is made on a Hadoop platform. The results show that MP-SYYT is an efficient and accurate algorithm with good scalability and acceleration ratio, and it is suitable for similar duplicate record detection of big data.

Key words: cloud environment, big data, similar duplicate records, parallel detection, redundant identification

CLC Number: