上海交通大学学报(英文版) ›› 2012, Vol. 17 ›› Issue (2): 245-250.doi: 10.1007/s12204-012-1261-5

• 论文 • 上一篇    下一篇

Exploiting Empirical Variance for Data Stream Classification

ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛)   

  1. (1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China; 2. School of Computer Science, Florida International University, Miami 33199, USA)
  • 出版日期:2012-04-28 发布日期:2012-05-31

Exploiting Empirical Variance for Data Stream Classification

ZIA-UR REHMAN Muhammad1∗, LI Tian-rui1 (李天瑞), LI Tao2 (李涛)   

  1. (1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China; 2. School of Computer Science, Florida International University, Miami 33199, USA)
  • Online:2012-04-28 Published:2012-05-31

摘要: Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.

关键词: Hoeffding and Bernstein’s bounds, data stream classification, decision tree, anytime algorithm

Abstract: Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.

Key words: Hoeffding and Bernstein’s bounds, data stream classification, decision tree, anytime algorithm

中图分类号: