Journal of Shanghai Jiaotong University ›› 2014, Vol. 48 ›› Issue (07): 936-941.

• Automation Technique, Computer Technology • Previous Articles     Next Articles

Estimation of Three-Way Similarities Based on Connected Bit Minwise Hash

YUAN Xinpan1,SHENG Xinhai1,LONG Jun2,ZHANG Zuping2,GUI Weihua2
  

  1. (1. College of Computer and Communication, Hunan University of Technology, Zhuzhou 412000, Hunan, China; 2. School of Information Science and Engineering, Central South University, Changsha 410083, China)
  • Received:2013-08-19 Online:2014-07-28 Published:2014-07-28

Abstract:

Compution of two-way and multi-way set similarities is a fundamental problem in information retrieval. This paper focused on estimation  of threeway resemblance using connected bit Minwise Hash. As an efficient and accurate method for similarity measurement, connected bit Minwise Hash can reduce the number of comparison, and exponentially improve the performance. The unbiased estimator of the threeway resemblance was provided theoretically. In experimental result analysis, several key parameters (e.g., precision, recall and efficiency) were analyzed. Experimental results demonstrate that when the sample size k=500 and similarity threshold R0=0.8, the accuracy and recall of the algorithm could reach 95% or more, using just 50% of CPU running time of b-bit Minwise Hash for the three-way estimation.
 

Key words: three-way resemblance; similarity estimation for three way; connected bit, information retrieval

CLC Number: