With the purpose of improving the accuracy of text categorization and reducing the dimension of
the feature space, this paper proposes a two-stage feature selection method based on a novel category correlation
degree (CCD) method and latent semantic indexing (LSI). In the first stage, a novel CCD method is proposed to
select the most effective features for text classification, which is more effective than the traditional feature selection
method. In the second stage, document representation requires a high dimensionality of the feature space and
does not take into account the semantic relation between features, which leads to a poor categorization accuracy.
So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the
individual terms which can discover the important correlative relationship between features and reduce the feature
space dimension. Firstly, each feature in our algorithm is ranked depending on their importance of classification
using CCD method. Secondly, we construct a new semantic space based on LSI method among features. The
experimental results have proved that our method can reduce effectively the dimension of text vector and improve
the performance of text categorization.
WANG Fei (王飞), LI Cai-hong* (李彩虹), WANG Jing-shan (王景山),XU Jiao (徐娇), LI Lian (李廉)
. A Two-Stage Feature Selection Method for Text Categorization by Using Category Correlation Degree and Latent Semantic Indexing[J]. Journal of Shanghai Jiaotong University(Science), 2015
, 20(1)
: 44
-50
.
DOI: 10.1007/s12204-015-1586-y
[1] Uguz H. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm [J].Knowledge-Based Systems, 2011, 24: 1024-1032.
[2] Forman G. An extensive empirical study of feature selection metrics for text classification [J]. Journalof Machine Learning Research, 2003, 3: 1289-1305.
[3] Huang X H, Ye Y M, Du X L, et al. Short text clustering with expandingkeywords through concept graph [J]. Journal of Computational Information Systems,2013, 9(21): 8649-8657.
[4] Jiang J Y, Liou R J, Lee S J. A fuzzy selfconstructing feature clustering algorithm for text classification [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(3): 335-349.
[5] Meng J N, Lin H F. A two-stage feature selection method for text categorization [C]//Seventh International Conference on Fuzzy Systems and Knowledge Discovery. [s.l.]: IEEE, 2010: 1492-1496.
[6] Song Q B, Ni J J,Wang G T. A fast clustering-based feature subset selection algorithm for high-dimensional data [J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1): 1-14.
[7] Uysal A K, Gunal S. A novel probabilistic feature selection method for text classification [J]. Knowledge-Based Systems, 2012, 36: 226-235.
[8] Wu D, Zhang Y P, Wang X. Feature reduction methods for text classification [J]. Journal of Computational Information Systems, 2008, 4(2): 495-502.
[9] Meng J N, Lin H F, Yu Y H. A two-stage feature selection method for text categorization [J]. Computers and Mathematics with Applications, 2011, 62: 2793-2800.
[10] Shima K, Todoriki M, Suzuki A. SVM-based feature selection of latent semantic features [J]. Pattern Recognition Letters, 2004, 25: 1051-1057.
[11] Song W, Parks C. Genetic algorithm for text clustering based on latent semantic indexing [J]. Computers and Mathematics with Applications, 2009, 57: 1901-1907.
[12] Li X F, Tian X D. Two steps features selection andsupport vector machines for Web page text categorization[J]. Journal of Computational Information Systems,2008, 4(1): 133-138.
[13] Zhao Z, Wang L, Liu H, et al. On similarity preserving feature selection [J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3): 619-632.