Journal of shanghai Jiaotong University (Science) ›› 2015, Vol. 20 ›› Issue (1): 44-50.doi: 10.1007/s12204-015-1586-y

Previous Articles     Next Articles

A Two-Stage Feature Selection Method for Text Categorization by Using Category Correlation Degree and Latent Semantic Indexing

A Two-Stage Feature Selection Method for Text Categorization by Using Category Correlation Degree and Latent Semantic Indexing

WANG Fei (王飞), LI Cai-hong* (李彩虹), WANG Jing-shan (王景山),XU Jiao (徐娇), LI Lian (李廉)   

  1. (School of Information Science & Engineering, Lanzhou University, Lanzhou 73000, China)
  2. (School of Information Science & Engineering, Lanzhou University, Lanzhou 73000, China)
  • Online:2015-02-28 Published:2015-03-10
  • Contact: LI Cai-hong (李彩虹) E-mail:lich6013@lzu.edu.cn

Abstract: With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space, this paper proposes a two-stage feature selection method based on a novel category correlation degree (CCD) method and latent semantic indexing (LSI). In the first stage, a novel CCD method is proposed to select the most effective features for text classification, which is more effective than the traditional feature selection method. In the second stage, document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features, which leads to a poor categorization accuracy. So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension. Firstly, each feature in our algorithm is ranked depending on their importance of classification using CCD method. Secondly, we construct a new semantic space based on LSI method among features. The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.

Key words: text categorization| feature selection| latent semantic indexing (LSI)| category correlation degree(CCD)

摘要: With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space, this paper proposes a two-stage feature selection method based on a novel category correlation degree (CCD) method and latent semantic indexing (LSI). In the first stage, a novel CCD method is proposed to select the most effective features for text classification, which is more effective than the traditional feature selection method. In the second stage, document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features, which leads to a poor categorization accuracy. So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension. Firstly, each feature in our algorithm is ranked depending on their importance of classification using CCD method. Secondly, we construct a new semantic space based on LSI method among features. The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.

关键词: text categorization| feature selection| latent semantic indexing (LSI)| category correlation degree(CCD)

CLC Number: