sa ›› 2018, Vol. 23 ›› Issue (3): 404-.doi: 10.1007/s12204-017-1899-0

• • 上一篇    下一篇

Research on Web Page Classification Method Based on Query Log

YE Feiyue (叶飞跃), MA Yixing (马祎星)   

  1. (School of Computer Engineering and Science, Shanghai University, Shanghai 201900, China)
  • 出版日期:2018-05-31 发布日期:2018-06-17
  • 通讯作者: MA Yixing (马祎星) E-mail:myx20080227@126.com

Research on Web Page Classification Method Based on Query Log

YE Feiyue (叶飞跃), MA Yixing (马祎星)   

  1. (School of Computer Engineering and Science, Shanghai University, Shanghai 201900, China)
  • Online:2018-05-31 Published:2018-06-17
  • Contact: MA Yixing (马祎星) E-mail:myx20080227@126.com

摘要: Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classiˉcation and vertical search. Methods based on query log which is a light weight version of Web page classiˉcation can avoid Web content crawling, making it relatively high in e±ciency, but the sparsity of user click data makes it di±cult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among di?erent queries through word embedding, and propose three improved graph structure classification algorithms. To re°ect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the ˉrst step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

关键词: Web page classification, word embedding, query log

Abstract: Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classiˉcation and vertical search. Methods based on query log which is a light weight version of Web page classiˉcation can avoid Web content crawling, making it relatively high in e±ciency, but the sparsity of user click data makes it di±cult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among di?erent queries through word embedding, and propose three improved graph structure classification algorithms. To re°ect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the ˉrst step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

Key words: Web page classification, word embedding, query log

中图分类号: