Semi-Supervised Learning in Large Scale Text Categorization

XU Zewen1,2 (许泽文), LI Jianqiang1,2,3,4* (李建强), LIU Bo1 (刘博),BI Jing1 (毕敬), LI Rong1 (李蓉), MAO Rui3,4 (毛睿)

doi:10.1007/s12204-017-1835-3

Journal of Shanghai Jiaotong University(Science) >

2017 , Vol. 22 >Issue 3: 291 - 302

DOI: https://doi.org/10.1007/s12204-017-1835-3

Semi-Supervised Learning in Large Scale Text Categorization

Expand

(1. School of Software Engineering, Beijing University of Technology, Beijing 100124, China; 2. Beijing Engineering Research Center for IoT Software and Systems, Beijing University of Technology, Beijing 100124, China; 3. Guangdong Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen 518060, Guangdong, China; 4. Shenzhen Key Laboratory of Service Computing and Applications, Shenzhen University, Shenzhen 518060, Guangdong, China)

Online published: 2017-06-04

Fold

Abstract

The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents, we obtain the traditional supervised classifier for text categorization (TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text (FACT) based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data, and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine (SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC.

Key words： text data mining; semi-supervised; automatic tagging; classifier

Cite this article

XU Zewen1,2 (许泽文), LI Jianqiang1,2,3,4* (李建强), LIU Bo1 (刘博),BI Jing1 (毕敬), LI Rong1 (李蓉), MAO Rui3,4 (毛睿) . Semi-Supervised Learning in Large Scale Text Categorization[J]. Journal of Shanghai Jiaotong University(Science), 2017 , 22(3) : 291 -302 . DOI: 10.1007/s12204-017-1835-3

References

[1] LI J Q, ZHAO Y, LIU B. Exploiting semantic resourcesfor large scale text categorization [J]. Journal of IntelligentInformation Systems, 2012, 39(3): 763-788. [2] MIYATO T, DAI A M, GOODFELLOW I. Virtualadversarial training for semi-supervised text classification[EB/OL]. (2016-07-22). https://arxiv.org/abs/1605.07725v1. [3] YIN C Y, XIANG J, ZHANG H, et al. A new SVMmethod for short text classification based on semisupervisedlearning [C]//2015 4th International Conferenceon Advanced Information Technology and SensorApplication. Dubai, UAE: IEEE, 2015: 100-103. [4] JOHNSON R, ZHANG T. Semi-supervised convolutionalneural networks for text categorization via regionembedding [J]. Advances in Neural InformationProcessing Systems, 2015, 28: 919-927. [5] JOHNSON R, ZHANG T. Supervised and semisupervisedtext categorization using LSTM for regionembeddings [C]//Proceedings of the 33rd InternationalConference on Machine Learning. New York, USA:JMLR W&CP, 2016: 1-9. [6] SEBASTIANI F. Machine learning in automated textcategorization [J]. ACM Computing Surveys, 2002,34(1): 1-47. [7] JOACHIMS T. Transductive inference for text classificationusing support vector machines [C]//Proceedingsof the 16th International Conference on MachineLearning. Bled, Slovenia: [s.n.], 1999: 200-209. [8] SIOLAS G, D’ALCH′ E-BUC F. Support vector machinesbased on a semantic kernel for text categorization[C]//Proceedings of the IEEE-INNS-ENNS InternationalJoint Conference on Neuralnetworks. Washington,USA: IEEE, 2000: 205-209. [9] BASILI R, CAMMISA M, MOSCHITTI A. Effectiveuse of Wordnet semantics via kernel-basedlearning [C]//Proceedings of the 9th Conference onComputational Natural Language Learning. Ann Arbor,USA: Association for Computational Linguistics,2005: 1-8. [10] GABRILOVICH E, MARKOVITCH S. Feature generationfor text categorization using world knowledge[C]//International Joint Conference on Artificial Intelligence.[s.l.]: Morgan Kaufmann Publishers Inc,2005: 1048-1053. [11] WANG P, DOMENICONI C. Building semantic kernelsfor text classification using wikipedia [C]//ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining. Las Vegas, USA: ACM, 2008:713-721. [12] CHAPELLE O, SCH¨OLKOPF B, ZIEN A. Semisupervisedlearning [M]. London, England: MIT Press,2006. [13] SINDHWANI V, KEERTHI S S. Large scale semisupervisedlinear SVMs [C]//International ACM SIGIRConference on Research and Development in InformationRetrieval. Washington, USA: ACM, 2006:477-484. [14] SINDHWANI V, KEERTHI S S. Newton methodsfor fast solution of semi-supervised linear SVMs[EB/OL]. (2016-07-22). http: //citeseerx.ist.psu.edu/viewdoc/download. [15] LI C H, YANG J C, PARK S C. Text categorization algorithmsusing semantic approaches, corpus-based thesaurusand WordNet [J]. Expert Systems with Applications,2012, 39: 765-772. [16] FOX-ROBERTS P, ROSTEN E. Unbiased generativesemi-supervised learning [J]. Journal of MachineLearning Research, 2014, 15: 367-443. [17] SHANG F H, JIAO L C, LIU Y Y, et al. Semisupervisedlearning with nuclear norm regularization[J]. Pattern Recognization, 2013, 46(8): 2323-2336. [18] WANG J, JEBARA T, CHANG S F. Semi-supervisedlearning using greedy max-cut [J]. Journal of MachineLearning Research, 2013, 14: 729-758. [19] CHENG S, SHI Y H, QIN Q D. Particle swarmoptimization based semi-supervised learning on chinesetext categorization [C]//Proceedings of the 2012IEEE Congress on Evolutionary Computation. Brisbane,Australia: IEEE, 2012: 1-8. [20] LENG Y, XU X Y, QI G H. Combining active learningand semi-supervised learning to construct SVM classifier[J]. Knowledge-Based Systems, 2013, 44(1): 121-131. [21] LI J Q, LIU C C, LIU B, et al. Diversity-aware retrievalof medical records [J]. Compuer in Industries, 2015,69(1): 81-91. [22] YANG J M, LIU Y N, ZHU X D, et al. A new featureselection based on comprehensive measurement bothin inter-category and intra-category for text categorization[J]. Information Processing and Management,2012, 48(4): 741-754. [23] BREVE F, ZHAO L, QUILES M, et al. Particlecompetition and cooperation in networks for semisupervisedlearning [J]. IEEE Transactions on Knowledgeand Data Engineering, 2011, 24(9): 1686-1698. [24] LI J Q, WANG F. Semi-supervised learning via meanfield methods [J]. Neurocomputing, 2016, 177: 385-393.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References