Articles

Exploiting Empirical Variance for Data Stream Classification

Expand
  • (1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China; 2. School of Computer Science, Florida International University, Miami 33199, USA)

Online published: 2012-05-31

Abstract

Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.

Cite this article

ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛) . Exploiting Empirical Variance for Data Stream Classification[J]. Journal of Shanghai Jiaotong University(Science), 2012 , 17(2) : 245 -250 . DOI: 10.1007/s12204-012-1261-5

References

[1] Gaber M M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review [J]. ACM SIGMOD Record,2005, 34(2): 18-26.

[2] Stonebraker M, Cetintemel U, Zdonik S. The 8 requirements of real-time stream processing [J]. ACM SIGMOD Record, 2005, 34(4): 42-47.

[3] Han J, Kamber M. Data mining: Concepts and techniques [M]. San Francisco, USA: Morgan Kaufmann Publishers, 2006.

[4] Domingos P, Hulten G. Mining high-speed data streams [C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining. Boston, USA: ACM, 2000: 71-80.

[5] Kirkby R. Improving Hoeffding trees [D]. Hamilton New Zealand: Department of Computer Science, University of Waikato, 2007.

[6] Jin R, Agrawal G. Efficient decision tree construction on streaming data [C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. Washington, USA:ACM, 2003: 571-576.

[7] Bennett G. Probability inequalities for the sum of independent random variables [J]. Journal of the American Statistical Association, 1962, 57(297): 33-45.

[8] Hoeffding W. Probability inequalities for sums of bounded random variables [J]. Journal of the American Statistical Association, 1963, 58(301): 13-30.

[9] Audibert J Y, Munos R, Szepesv´ari C. Tuning bandit algorithms in stochastic environments [C]//Proceedings of the 18th International Conference

on Algorithmic Learning Theory. Sendai, Japan: LNCS, 2007: 150-165.

[10] Shivaswamy P K, Jebara T. Empirical Bernstein boosting [J]. Journal of Machine Learning Research, 2010, 9: 733-740.

[11] Cesa-Bianchi N, Conconi A, Gentile C. A secondorder perceptron algorithm [J]. SIAM Journal on Computing, 2005, 34(3): 640-668.

[12] Crammer K, Mohri M, Pereira F. Gaussian margin machines [C]// Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.

Florida, USA: JMLR W&CP, 2009: 105-112.

[13] Mnih V, Szepesv´ari C, Audibert J Y. Empirical Bernstein stopping [C]//Proceedings of the 25th International Conference on Machine Learning. Helsinki,

Finland: ACM, 2008: 672-679.

[14] Jin R, Hong H, Wang H, et al. Computing label-constraint reachability in graph databases [C]//Proceedings of the International Conference on

Management of Data. Indiana, USA: ACM, 2010: 123-134.

[15] Heidrich-Meisner V, Igel C. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search [C]//Proceedings of the 26th Annual International

Conference on Machine Learning. Quebec, Canada: ACM, 2009: 401-408.
Options
Outlines

/