基于文本对象模型的自动化网页内容提取方法

doi:10.16183/j.cnki.jsjtu.2018.10.027

摘要/Abstract

摘要： 网页内容提取在信息检索、文本分析以及网络资源数据处理等领域具有重要的工程与应用价值.针对网页中的大量无关内容及网页结构的异构性所造成的网页内容提取难题，提出一种基于文本对象模型(DOM)的自动化网页内容提取方法.首先，在节点过滤后，对网页的DOM模型进行压缩，便于后续分析处理；然后，提出基于文本-链接密度的内容提取方法来识别网页内容；最后，基于节点熵来识别并去除网页内容中的噪声链接.实验结果表明，相比于传统的网页内容提取方法，该方法的准确率和F1分数均有明显提升，而召回率仅有轻微下降.

关键词: 文本对象模型, 网页内容提取, 文本密度, 节点熵

Abstract: Web content extraction has great engineering and application value in the fields of information retrieval, text analysis and network resource data processing. In view of the problem of web content extraction caused by useless information on web pages and the heterogeneity of web page structures, this paper proposes an automated web page content extraction method based on Document Object Model (DOM). Firstly, for DOMs generated from original web pages, we remove useless nodes from them and then compress the models, which facilitates subsequent processing. Then, we identify the web page content based on text and hyperlink density. Finally, we identify the noise hyperlinks based on node entropy and remove them from the content. The experimental results show that compared with the traditional methods of web page content extraction, the accuracy and F1 score of our method are obviously improved while there is only a slight decline on recall.

Key words: document object model (DOM), content extraction of web pages, text density, node entropy

中图分类号:

TP 391

李桐宇，任锐，蔡鸿明，姜丽红. 基于文本对象模型的自动化网页内容提取方法[J]. 上海交通大学学报, 2018, 52(10): 1363-1369.

LI Tongyu,REN Rui,CAI Hongming,JIANG Lihong. Automated Web Page Content Extraction Method Based on Document Object Model[J]. Journal of Shanghai Jiao Tong University, 2018, 52(10): 1363-1369.

参考文献

［1］WENINGER T, PALACIOS R, CRESCENZI V, et al. Web content extraction: A meta-analysis of its past and thoughts on its future［J］. ACM SIGKDD Explorations Newsletter, 2016, 17(2): 17-23. ［2］BORGOLTE K, KRUEGEL C, VIGNA G. Relevant change detection: A framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines［C］//Proceedings of the 23rd International Conference on World Wide Web. Seoul: ACM, 2014: 595-598. ［3］PETPRASIT W, JAIYEN S. E-commerce web page classification based on automatic content extraction［C］//2015 12th International Joint Conference on Computer Science and Software Engineering (JCSSE). Songkhla: IEEE, 2015: 74-77. ［4］KADAM V, DEVALE P R. A methodology for template extraction from heterogeneous web pages［J］. Indian Journal of Computer Science and Engineering (IJCSE), 2012, 3(3): 449-452. ［5］WU S, LIU J, FAN J. Automatic web content extraction by combination of learning and grouping［C］//Proceedings of the 24th international conference on World Wide Web. Switzerland: International World Wide Web Conferences Steering Committee, 2015: 1264-1274. ［6］KIM M, KIM Y, SONG W, et al. Main content extraction from Web documents using text block context［C］//International Conference on Database and Expert Systems Applications, Prague. Berlin, Heidelberg: Springer, 2013: 81-93. ［7］REIS D C, GOLGHER P B, SILVA A S, et al. Automatic web news extraction using tree edit distance［C］//Proceedings of the 13th international conference on World Wide Web. New York: ACM, 2004: 502-511. ［8］杨柳青, 李晓东, 耿光刚.基于布局相似性的网页正文内容提取研究［J］.计算机应用研究, 2015, 32(9): 2581-2586. YANG Liuqing, LI Xiaodong, GENG Guanggang. Study of web pages content extraction based on layout similarity［J］. Application Research of Computers, 2015, 32(9): 2581-2586. ［9］CAI D, YU S, WEN J R, et al. VIPS: A vision-based page segmentation algorithm［R］. Beijing: Microsoft, 2003. ［10］WANG P, ZHOU M, YOU Y, et al. A new vision-based method for extracting academic information from conference Web pages［C］//IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI). Athens: IEEE, 2012: 976-981. ［11］SUN F, SONG D, LIAO L. Dom based content extraction via text density［C］//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. Beijing: ACM, 2011: 245-254. ［12］FU L, MENG Y, XIA Y, et al. Web content extraction based on webpage layout analysis［C］//Second International Conference on Information Technology and Computer Science (ITCS). Kiev: IEEE, 2010: 40-43. ［13］WENINGER T, HSU W H. Text extraction from the web via text-to-tag ratio［C］//19th International Workshop on Database and Expert Systems Application (DEXA). Turin: IEEE, 2008: 23-28. ［14］ZHENG X, GU Y, LI Y. Data extraction from web pages based on structural-semantic entropy［C］//Proceedings of the 21st International Conference on World Wide Web. Lyon: ACM, 2012: 93-102. ［15］LIU Q, SHAO M, WU L, et al. Main content extraction from web pages based on node characteristics［J］. Journal of Computing Science and Engineering, 2017, 11(2): 39-48.

[1]	白雄飞, 龚水成, 李雪松, 许博, 杨晓力, 王明彦. 基于泊松融合数据增强的焊缝金相组织缺陷分类研究[J]. 上海交通大学学报, 2023, 57(10): 1316-1328.
[2]	高涛, 文渊博, 陈婷, 张静. 基于窗口自注意力网络的单图像去雨算法[J]. 上海交通大学学报, 2023, 57(5): 613-623.
[3]	钱鹏, 王国亮, 朱文峰. 柔性变形下车窗升降三维装配公差建模及优化[J]. 上海交通大学学报, 2020, 54(11): 1134-1141.
[4]	包清临, 柴华奇, 赵嵩正, 王吉林. 采用机器学习算法的技术机会挖掘模型及应用[J]. 上海交通大学学报, 2020, 54(7): 705-717.
[5]	李柏鹤, 蒋祖华, 陶宁蓉, 孟令通, 郑虹. 考虑平板车合作运输的船舶分段堆场间调度[J]. 上海交通大学学报, 2020, 54(7): 718-727.
[6]	马仲航, 张执南. 多旋翼无人机遥操机械臂多功能仿真实验平台的设计与实现[J]. 上海交通大学学报, 2020, 54(6): 636-642.
[7]	孟令通, 蒋祖华, 陶宁蓉, 刘建峰, 郑虹. 考虑工艺顺序和组合分段的多堆场调度方法[J]. 上海交通大学学报, 2020, 54(4): 331-343.
[8]	张洁，赵新明，张朋，盛夏，晁晓娜，田凤祥. 面向火箭总装过程的工期延误预警方法[J]. 上海交通大学学报, 2020, 54(3): 322-330.
[9]	孙铭阳，颜国正，刘大生，王志武，韩玎，赵凯，杨雷. 基于超宽带技术的强制戒毒人员实时定位系统[J]. 上海交通大学学报, 2020, 54(1): 76-84.
[10]	章云港,杨剑锋,易本顺. 低剂量CT图像去噪的改进型残差编解码网络[J]. 上海交通大学学报, 2019, 53(8): 983-989.
[11]	王红雨，尹午荣，汪梁，胡江颢，乔文超. 基于HSV颜色空间的快速边缘提取算法[J]. 上海交通大学学报, 2019, 53(7): 765-772.
[12]	周炳海，刘文龙. 考虑能耗和准时的混合流水线多目标调度[J]. 上海交通大学学报, 2019, 53(7): 773-779.
[13]	孟令通，蒋祖华，陶宁蓉，刘建峰，李柏鹤. 船舶组合分段堆场调度方法[J]. 上海交通大学学报, 2019, 53(7): 780-788.
[14]	江旭东，李鹏飞，刘铮，滕晓艳. 基于剪切稀化效应的血液流体-扩张血管耦合模型的血管损伤分析[J]. 上海交通大学学报, 2019, 53(6): 757-764.
[15]	唐然，赵迎新，吴虹. 基于改进反馈判决的自动识别系统信号解调算法[J]. 上海交通大学学报, 2019, 53(5): 610-615.