Automatically Finding the Number of Clusters Based on Simulated Annealing

Expand
  • (Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China)

Online published: 2017-04-04

Abstract

Based on simulated annealing (SA), automatically finding the number of clusters (AFNC) is proposed in this paper to determine the number of clusters and their initial centers. It is a simple and automatic method that combines local search with two widely-accepted global analysis techniques, namely careful-seeding (CS) and distance-histogram (DH). The procedure for finding a cluster is formulated as mountain-climbing, and the mountain is defined as the convergent domain of SA.When arriving at the peak of one mountain, AFNC has found one of the clusters in the dataset, and its initial center is the peak. Then, AFNC continues to climb up another mountain from a new starting point found by CS till the termination condition is satisfied. In the procedure of climbing-up mountain, the local dense region for searching the next state of SA is found by analyzing the distance histogram. Experimental results show that AFNC can achieve consistent performance for a wide range of datasets.

Cite this article

YANG Zhengwu (杨政武), HUO Hong (霍宏), FANG Tao*(方涛) . Automatically Finding the Number of Clusters Based on Simulated Annealing[J]. Journal of Shanghai Jiaotong University(Science), 2017 , 22(2) : 139 -147 . DOI: 10.1007/s12204-017-1813-9

References

[1] XU R. Survey of clustering algorithms [J]. IEEE Transactionon Neural Networks, 2005, 16(3): 645-678. [2] WANG L, LECKIE C, RAMAMOHANARAOK, et al.Automatically determining the number of clusters inunlabeled data sets [J]. IEEE Transaction on Knowledgeand Data Engineering, 2009, 21(3): 335-350. [3] CHEN C, PAU L, WANG P. Handbook of patternrecognition and computer vision [M]. Singapore:World Scientific, 1993. [4] CALI ′NSKI R, HARABASZ J. A denrite methodfor cluster analysis [J]. Communications in Statistics,1974, 3(1): 1-27. [5] HARTIGAN J A. Clustering algorithms [M]. Toronto:Wiley, 1975. [6] KRZANOWSKI W J, LAI Y T. A criterion for determiningthe number of clusters in a dataset [J]. Biometrics,1985, 44(1): 23-34. [7] SUGAR C A, JAMES G M. Finding the number ofclusters in a dataset: An information theoretic approach[J]. Journal of American Statistical Association,2003, 98: 750-763. [8] ROUSSEEUW P J. Silhouettes: A graphical aid tothe interpretation and validation of cluster analysis [J].Journal of Computational and Applied Mathematics,1987, 20: 53-65. [9] TIBSHIRANI R, WALTHER G, HASTIE T. Estimatingthe number of clusters in a dataset via the gapstatistic [J]. Journal of the Royal Statistical Society,Series B, 2001, 63: 411-423. [10] PERMUTER H, FRANCOS J, JERMYN I H. Gaussianmixture models of texture and colour for imagedatabase retrieval [C]//Proceedings of ICASSP. HongKong, China: IEEE, 2003: 569-572. [11] VERMA B, RAHMAN A. Cluster-oriented ensembleclassifier: Impact of multicluster characterization onensemble classifier learning [J]. IEEE Transaction onKnowledge and Data Engineering, 2012, 24(4): 605-618. [12] WANG J H. Consistent selection of the number of clustersvia cross-validation [J]. Biometrika, 2010, 97(4):893-904. [13] EVERITT B, LANDAU S, LEESE M. Cluster analysis[M]. London: Arnold, 2001. [14] KIRKPATRICK S, GELATT C D, VECCHI J MP. Optimization by simulated annealing [J]. Science,1983, 220(4598): 671-681. [15] BERTSIMAS D, TSITSIKLIS J. Simulated annealing[J]. Statistical Science, 1993, 8(1): 10-15. [16] CHIB S, GREENBERG E. Understanding theMetropolis-Hastings algorithm [J]. American Statistician,1995, 49(4): 327-335. [17] FAIGLE U, KERNW. Note on the convergence of simulatedannealing algorithms [J]. SIAM Journal of Controland Optimization, 1991, 29(1): 153-159. [18] ARTHUR D, VASSILVITSKII S. k-means++: Theadvantage of careful seeding [C]//Proceedings of theEighteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms. New Orleans, Louisiana: ACM,2007: 1027-1035. [19] MCALLESTER D, SELMAN B, KAUTZ H. Evidencefor invariants in local search [C]//Proceedings of the14th National Conference on Artificial Intelligence.Menlo Park, USA: AAAI Press, 1997: 321-326. [20] YANG Z W, FANG T. On the accuracy of image normalizationby Zernike moments [J]. Image and VisionComputing, 2010, 28: 403-413. [21] LICHMAN M. UCI machine learning database[DB/OL]. (2010-02-02). http://archive.ics.uci.edu/ml/. [22] BREITENBACH M, GRUDIC G E. Clusteringthrough ranking on manifolds [C]//Proceedings of22nd International Conference on Machine Learning.Bonn, Germany: ACM, 2005: 73-80. [23] MANJUNATH B S, MA W Y. Texture features forbrowsing and retrieval of image data [J]. IEEE Transactionon Pattern Analysis and Machine Intelligence,1996, 18(8): 837-842.

Options
Outlines

/