上海交通大学学报(自然版) ›› 2014, Vol. 48 ›› Issue (07): 936-941.

• 自动化技术、计算机技术 • 上一篇    下一篇

基于连接位Minwise Hash的三者相似性估计算法

袁鑫攀1,盛鑫海1,龙军2,张祖平2,桂卫华2
  

  1. (1.湖南工业大学 计算机与通信学院, 湖南 株洲 412000; 2.中南大学 信息科学与工程学院,长沙 410083)
     
  • 收稿日期:2013-08-19 出版日期:2014-07-28 发布日期:2014-07-28
  • 基金资助:

    国家自然科学基金(61121008,60970095,60873081,61350011),科技部科技支撑计划(2013BAJ10B145),湖南省杰出青年基金(11JJ1012),湖南省自然科学基金重点项目(12JJ2036),湖南省自然科学基金面上项目(14JJ2115)资助

Estimation of Three-Way Similarities Based on Connected Bit Minwise Hash

YUAN Xinpan1,SHENG Xinhai1,LONG Jun2,ZHANG Zuping2,GUI Weihua2
  

  1. (1. College of Computer and Communication, Hunan University of Technology, Zhuzhou 412000, Hunan, China; 2. School of Information Science and Engineering, Central South University, Changsha 410083, China)
  • Received:2013-08-19 Online:2014-07-28 Published:2014-07-28

摘要:

计算相似性是信息检索的一个核心基础问题,二者、三者甚至更多集合的相似性估计在相似文档检测、词语相关性、聚类、数据清理等领域有着广泛的应用. 连接位Minwise Hash算法作为一种高效、准确的相似性估计算法,能够成倍地减少比对的次数,提升算法性能. 通过理论推导,给出基于连接位Minwise Hash的三者相似度无偏估计公式.实验结果显示,在样本大小k=500、相似度阈值R0=0.8时,算法的准确率和召回率均能达到95%以上,并且所需的CPU运行时间仅为b位Minwise Hash三者估计算法的50%.
 
 

关键词: 三者相似度, 三者相似性估计, 连接位, 信息检索

Abstract:

Compution of two-way and multi-way set similarities is a fundamental problem in information retrieval. This paper focused on estimation  of threeway resemblance using connected bit Minwise Hash. As an efficient and accurate method for similarity measurement, connected bit Minwise Hash can reduce the number of comparison, and exponentially improve the performance. The unbiased estimator of the threeway resemblance was provided theoretically. In experimental result analysis, several key parameters (e.g., precision, recall and efficiency) were analyzed. Experimental results demonstrate that when the sample size k=500 and similarity threshold R0=0.8, the accuracy and recall of the algorithm could reach 95% or more, using just 50% of CPU running time of b-bit Minwise Hash for the three-way estimation.
 

Key words: three-way resemblance; similarity estimation for three way; connected bit, information retrieval

中图分类号: