Classification, using the decision tree algorithm, is a widely studied problem in data streams. The challenge is when to split a decision node into multiple leaves. Concentration inequalities, that exploit variance information such as Bernstein’s and Bennett’s inequalities, are often substantially strict as compared with Hoeffding’s bound which disregards variance. Many machine learning algorithms for stream classification such as very fast decision tree (VFDT) learner, AdaBoost and support vector machines (SVMs), use the Hoeffding’s bound as a performance guarantee. In this paper, we propose a new algorithm based on the recently proposed empirical Bernstein’s bound to achieve a better probabilistic bound on the accuracy of the decision tree. Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.
ZIA-UR REHMAN Muhammad1, LI Tian-rui1 (李天瑞), LI Tao2 (李涛)
. Exploiting Empirical Variance for Data Stream Classification[J]. Journal of Shanghai Jiaotong University(Science), 2012
, 17(2)
: 245
-250
.
DOI: 10.1007/s12204-012-1261-5
[1] Gaber M M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review [J]. ACM SIGMOD Record,2005, 34(2): 18-26.
[2] Stonebraker M, Cetintemel U, Zdonik S. The 8 requirements of real-time stream processing [J]. ACM SIGMOD Record, 2005, 34(4): 42-47.
[3] Han J, Kamber M. Data mining: Concepts and techniques [M]. San Francisco, USA: Morgan Kaufmann Publishers, 2006.
[4] Domingos P, Hulten G. Mining high-speed data streams [C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. Boston, USA: ACM, 2000: 71-80.
[5] Kirkby R. Improving Hoeffding trees [D]. Hamilton New Zealand: Department of Computer Science, University of Waikato, 2007.
[6] Jin R, Agrawal G. Efficient decision tree construction on streaming data [C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. Washington, USA:ACM, 2003: 571-576.
[7] Bennett G. Probability inequalities for the sum of independent random variables [J]. Journal of the American Statistical Association, 1962, 57(297): 33-45.
[8] Hoeffding W. Probability inequalities for sums of bounded random variables [J]. Journal of the American Statistical Association, 1963, 58(301): 13-30.
[9] Audibert J Y, Munos R, Szepesv´ari C. Tuning bandit algorithms in stochastic environments [C]//Proceedings of the 18th International Conference
on Algorithmic Learning Theory. Sendai, Japan: LNCS, 2007: 150-165.
[10] Shivaswamy P K, Jebara T. Empirical Bernstein boosting [J]. Journal of Machine Learning Research, 2010, 9: 733-740.
[11] Cesa-Bianchi N, Conconi A, Gentile C. A secondorder perceptron algorithm [J]. SIAM Journal on Computing, 2005, 34(3): 640-668.
[12] Crammer K, Mohri M, Pereira F. Gaussian margin machines [C]// Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.
Florida, USA: JMLR W&CP, 2009: 105-112.
[13] Mnih V, Szepesv´ari C, Audibert J Y. Empirical Bernstein stopping [C]//Proceedings of the 25th International Conference on Machine Learning. Helsinki,
Finland: ACM, 2008: 672-679.
[14] Jin R, Hong H, Wang H, et al. Computing label-constraint reachability in graph databases [C]//Proceedings of the International Conference on
Management of Data. Indiana, USA: ACM, 2010: 123-134.
[15] Heidrich-Meisner V, Igel C. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search [C]//Proceedings of the 26th Annual International
Conference on Machine Learning. Quebec, Canada: ACM, 2009: 401-408.