Exploiting Empirical Variance for Data Stream Classification

doi:10.1007/s12204-012-1261-5

上海交通大学学报（英文版） ›› 2012, Vol. 17 ›› Issue (2): 245-250.doi: 10.1007/s12204-012-1261-5

Exploiting Empirical Variance for Data Stream Classification

ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛)

(1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China; 2. School of Computer Science, Florida International University, Miami 33199, USA)

出版日期:2012-04-28 发布日期:2012-05-31

Exploiting Empirical Variance for Data Stream Classification

ZIA-UR REHMAN Muhammad1∗, LI Tian-rui1 (李天瑞), LI Tao2 (李涛)

(1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China; 2. School of Computer Science, Florida International University, Miami 33199, USA)

Online:2012-04-28 Published:2012-05-31

摘要/Abstract

摘要： Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.

关键词: Hoeffding and Bernstein’s bounds, data stream classification, decision tree, anytime algorithm

Abstract: Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.

Key words: Hoeffding and Bernstein’s bounds, data stream classification, decision tree, anytime algorithm

中图分类号:

TP 391

ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛). Exploiting Empirical Variance for Data Stream Classification[J]. 上海交通大学学报（英文版）, 2012, 17(2): 245-250.

ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛). Exploiting Empirical Variance for Data Stream Classification[J]. Journal of shanghai Jiaotong University (Science), 2012, 17(2): 245-250.

参考文献

[1] Gaber M M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review [J]. ACM SIGMOD Record,2005, 34(2): 18-26.

[2] Stonebraker M, Cetintemel U, Zdonik S. The 8 requirements of real-time stream processing [J]. ACM SIGMOD Record, 2005, 34(4): 42-47.

[3] Han J, Kamber M. Data mining: Concepts and techniques [M]. San Francisco, USA: Morgan Kaufmann Publishers, 2006.

[4] Domingos P, Hulten G. Mining high-speed data streams [C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining. Boston, USA: ACM, 2000: 71-80.

[5] Kirkby R. Improving Hoeffding trees [D]. Hamilton New Zealand: Department of Computer Science, University of Waikato, 2007.

[6] Jin R, Agrawal G. Efficient decision tree construction on streaming data [C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. Washington, USA:ACM, 2003: 571-576.

[7] Bennett G. Probability inequalities for the sum of independent random variables [J]. Journal of the American Statistical Association, 1962, 57(297): 33-45.

[8] Hoeffding W. Probability inequalities for sums of bounded random variables [J]. Journal of the American Statistical Association, 1963, 58(301): 13-30.

[9] Audibert J Y, Munos R, Szepesv´ari C. Tuning bandit algorithms in stochastic environments [C]//Proceedings of the 18th International Conference

on Algorithmic Learning Theory. Sendai, Japan: LNCS, 2007: 150-165.

[10] Shivaswamy P K, Jebara T. Empirical Bernstein boosting [J]. Journal of Machine Learning Research, 2010, 9: 733-740.

[11] Cesa-Bianchi N, Conconi A, Gentile C. A secondorder perceptron algorithm [J]. SIAM Journal on Computing, 2005, 34(3): 640-668.

[12] Crammer K, Mohri M, Pereira F. Gaussian margin machines [C]// Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.

Florida, USA: JMLR W&CP, 2009: 105-112.

[13] Mnih V, Szepesv´ari C, Audibert J Y. Empirical Bernstein stopping [C]//Proceedings of the 25th International Conference on Machine Learning. Helsinki,

Finland: ACM, 2008: 672-679.

[14] Jin R, Hong H, Wang H, et al. Computing label-constraint reachability in graph databases [C]//Proceedings of the International Conference on

Management of Data. Indiana, USA: ACM, 2010: 123-134.

[15] Heidrich-Meisner V, Igel C. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search [C]//Proceedings of the 26th Annual International

Conference on Machine Learning. Quebec, Canada: ACM, 2009: 401-408.

Exploiting Empirical Variance for Data Stream Classification

Exploiting Empirical Variance for Data Stream Classification

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

本文评价

[1]	XU Wei-tao 1,3 (许伟涛), XU Yang 1,2(徐扬). Ideal Resolution Principle for Lattice-Valued First-Order Logic Based on Lattice Implication Algebra[J]. 上海交通大学学报（英文版）, 2012, 17(2): 178-181.
[2]	CAO Xue (曹雪), CAI Hong-ming (蔡鸿明), BU Feng-lin (步丰林). Semantic Driven Design Reuse for 3D Scene Modeling[J]. 上海交通大学学报（英文版）, 2012, 17(2): 233-236.