Research on Web Page Classification Method Based on Query Log

doi:10.1007/s12204-017-1899-0

摘要/Abstract

摘要： Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classiˉcation and vertical search. Methods based on query log which is a light weight version of Web page classiˉcation can avoid Web content crawling, making it relatively high in e±ciency, but the sparsity of user click data makes it di±cult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among di?erent queries through word embedding, and propose three improved graph structure classification algorithms. To re°ect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the ˉrst step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

关键词: Web page classification, word embedding, query log

Abstract: Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classiˉcation and vertical search. Methods based on query log which is a light weight version of Web page classiˉcation can avoid Web content crawling, making it relatively high in e±ciency, but the sparsity of user click data makes it di±cult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among di?erent queries through word embedding, and propose three improved graph structure classification algorithms. To re°ect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the ˉrst step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

Key words: Web page classification, word embedding, query log

中图分类号:

TP 391.1

YE Feiyue (叶飞跃), MA Yixing (马祎星). Research on Web Page Classification Method Based on Query Log[J]. sa, 2018, 23(3): 404-.

YE Feiyue (叶飞跃), MA Yixing (马祎星). Research on Web Page Classification Method Based on Query Log[J]. Journal of Shanghai Jiao Tong University (Science), 2018, 23(3): 404-.

参考文献 24

[1]	SUN A X, LIM E P, NG W K. Web classiˉcation usingsupport vector machine [J]. Proceedings of the 4th In-ternational Workshop on Web Information and DataManagement (WIDM 2002). Virginia: ACM, 2002: 1-4.
[2]	SHIH L K, KARGER D R. Using URLs and tablelayout for Web classiˉcation tasks [C]// InternationalConference on World Wide Web. New York: ACM,2004: 193-202.
[3]	CRISTO M, CALADO P, DE MOURA E S, et al.Link information as a similarity measure inWeb classi-ˉcation [C]//International Symposium on String Pro-cessing and Information Retrieval. Manaus: Springer,2003: 43-55.
[4]	ANH N T K, THANH V M, LINH N V. E±cient la-bel propagation for classiˉcation on information net-works [C]//Symposium on Information & Communi-cation Technology. Ha Long: ACM, 2012: 41-46.
[5]	DUAN Q G, MIAO D Q, JIN K M. A rough setapproach to classifying Web page without negativeexamples [C]//Paciˉc-Asia Conference on Advancesin Knowledge Discovery and Data Mining. Nanjing:Springer, 2007: 481-488.
[6]	KIM S M, PANTEL P, DUAN L, et al. Improv-ing web page classiˉcation by label-propagation overclick graphs [C]//ACM Conference on Informationand Knowledge Management. Hong Kong: ACM, 2009:572-576.
[7]	NIE L, HUA Z G, HE X F, et al. Learning document la-bels from enriched click graphs [C]//the IEEE Interna-tional Conference on Data Mining Workshops. Sydney:IEEE, 2010: 57-64.
[8]	LI X, WANG Y Y, ACERO A. Learning query intentfrom regularized click graphs [C]// The InternationalACM SIGIR Conference. Singapore: ACM, 2008: 339-346.
[9]	ZHANG Z Y, NASRAOUI O. Mining searchengine query logs for query recommendation[C]//International Conference on World WideWeb. Edinburgh: ACM, 2006: 1039-1040.
[10]	ZHU X J, GHAHRAMANI Z B. Learning from labeled and unlabeled data with label propagation [R].Pittsburgh: Carnegie Mellon University, 2002.
[11]	HINTON G E. Learning distributed representations ofconcepts [C]//Proceedings of the Eighth Annual Con-ference of the Cognitive Science Society. Amherst, MA:[s.n.], 1986: 1-12.
[12]	BENGIO Y, SCHWENK H, SEN?ECAL J S, et al.Neural probabilistic language models [J]. Innovationsin Machine Learning: Theory and Applications, 2006,194: 137-186.
[13]	MIKOLOV T, CHEN K, CORRADO G, etal. E±cient estimation of word representa-tions in vector space [EB/OL].(2016-06-06).https://arxiv.org/abs/1301.3781v1.
[14]	MIKOLOV T, KARAFIAT M, BURGET L, etal. Recurrent neural network based language model[C]//Conference of the International Speech Commu-nication Association. Makuhari: ISCA, 2010: 1045-1048.
[15]	COLLOBERT R, WESTON J, BOTTOU L, et al.Natural language processing (almost) from scratch [J].Journal of Machine Learning Research, 2011, 12(1):2493-2537.
[16]	MIKOLOV T, LE Q V, SUTSKEVER I.Exploiting similarities among languages formachine translation [EB/OL]. (2016-06-06).https://arxiv.org/abs/1309.4168.
[17]	FROME A, CORRADO G S, SHLENS J, et al.DeVise: A deep visual-semantic embedding model[C]//Conference on Neural Information ProcessingSystems. [s.l.]: IEEE, 2013: 2121-2129.
[18]	SOCHER R, CHEN D Q, MANNING C D, et al. Reasoning with neural tensor networks for knowledge basecompletion [C]//Advances in Neural Information Processing Systems. South Lake Tahoe: NIPS, 2013: 926-934.
[19]	TANG D, WEI F, YANG N, et al. Learning sentimentspeciˉc word embedding for twitter sentiment classiˉcation [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.Baltimore, Maryland: Association for ComputationalLinguistics, 2014: 1555-1565.
[20]	SOCHER R, HUVAL B, MANNING C D, et al.Semantic compositionality through recursive matrixvector spaces [C]//Joint Conference on EmpiricalMethods in Natural Language Processing and Computational Natural Language Learning. Jeju Island: [s.n.],2012: 1201-1211.
[21]	WHITE L, TOGNERI R, LIU W, et al. How well sentence embeddings capture meaning [C]//AustralasianDocument Computing Symposium. Parramatta: ACM,2015: 1-8.
[22]	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases andtheir compositionality [J]. Advances in neural information processing systems, 2013, 26: 3111-3119.
[23]	YANG H B, HU Q M, HE L. Learning topic-orientedword embedding for query classiˉcation [C]//Advancesin Knowledge Discovery and Data Mining. [s.l.]:Springer International Publishing Switzerland, 2015:188-198.
[24]	JIANG S, HU Y N, KANG C S, et al. Learning queryand document relevance from a Web-scale click graph[C]//The International ACM SIGIR Conference. Pisa:ACM, 2016: 185-194.