Automated Web Page Content Extraction Method Based on
 Document Object Model

doi:10.16183/j.cnki.jsjtu.2018.10.027

Abstract

Abstract: Web content extraction has great engineering and application value in the fields of information retrieval, text analysis and network resource data processing. In view of the problem of web content extraction caused by useless information on web pages and the heterogeneity of web page structures, this paper proposes an automated web page content extraction method based on Document Object Model (DOM). Firstly, for DOMs generated from original web pages, we remove useless nodes from them and then compress the models, which facilitates subsequent processing. Then, we identify the web page content based on text and hyperlink density. Finally, we identify the noise hyperlinks based on node entropy and remove them from the content. The experimental results show that compared with the traditional methods of web page content extraction, the accuracy and F1 score of our method are obviously improved while there is only a slight decline on recall.

Key words: document object model (DOM), content extraction of web pages, text density, node entropy

CLC Number:

TP 391

LI Tongyu,REN Rui,CAI Hongming,JIANG Lihong. Automated Web Page Content Extraction Method Based on Document Object Model[J]. Journal of Shanghai Jiao Tong University, 2018, 52(10): 1363-1369.

References

［1］WENINGER T, PALACIOS R, CRESCENZI V, et al. Web content extraction: A meta-analysis of its past and thoughts on its future［J］. ACM SIGKDD Explorations Newsletter, 2016, 17(2): 17-23. ［2］BORGOLTE K, KRUEGEL C, VIGNA G. Relevant change detection: A framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines［C］//Proceedings of the 23rd International Conference on World Wide Web. Seoul: ACM, 2014: 595-598. ［3］PETPRASIT W, JAIYEN S. E-commerce web page classification based on automatic content extraction［C］//2015 12th International Joint Conference on Computer Science and Software Engineering (JCSSE). Songkhla: IEEE, 2015: 74-77. ［4］KADAM V, DEVALE P R. A methodology for template extraction from heterogeneous web pages［J］. Indian Journal of Computer Science and Engineering (IJCSE), 2012, 3(3): 449-452. ［5］WU S, LIU J, FAN J. Automatic web content extraction by combination of learning and grouping［C］//Proceedings of the 24th international conference on World Wide Web. Switzerland: International World Wide Web Conferences Steering Committee, 2015: 1264-1274. ［6］KIM M, KIM Y, SONG W, et al. Main content extraction from Web documents using text block context［C］//International Conference on Database and Expert Systems Applications, Prague. Berlin, Heidelberg: Springer, 2013: 81-93. ［7］REIS D C, GOLGHER P B, SILVA A S, et al. Automatic web news extraction using tree edit distance［C］//Proceedings of the 13th international conference on World Wide Web. New York: ACM, 2004: 502-511. ［8］杨柳青, 李晓东, 耿光刚.基于布局相似性的网页正文内容提取研究［J］.计算机应用研究, 2015, 32(9): 2581-2586. YANG Liuqing, LI Xiaodong, GENG Guanggang. Study of web pages content extraction based on layout similarity［J］. Application Research of Computers, 2015, 32(9): 2581-2586. ［9］CAI D, YU S, WEN J R, et al. VIPS: A vision-based page segmentation algorithm［R］. Beijing: Microsoft, 2003. ［10］WANG P, ZHOU M, YOU Y, et al. A new vision-based method for extracting academic information from conference Web pages［C］//IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI). Athens: IEEE, 2012: 976-981. ［11］SUN F, SONG D, LIAO L. Dom based content extraction via text density［C］//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. Beijing: ACM, 2011: 245-254. ［12］FU L, MENG Y, XIA Y, et al. Web content extraction based on webpage layout analysis［C］//Second International Conference on Information Technology and Computer Science (ITCS). Kiev: IEEE, 2010: 40-43. ［13］WENINGER T, HSU W H. Text extraction from the web via text-to-tag ratio［C］//19th International Workshop on Database and Expert Systems Application (DEXA). Turin: IEEE, 2008: 23-28. ［14］ZHENG X, GU Y, LI Y. Data extraction from web pages based on structural-semantic entropy［C］//Proceedings of the 21st International Conference on World Wide Web. Lyon: ACM, 2012: 93-102. ［15］LIU Q, SHAO M, WU L, et al. Main content extraction from web pages based on node characteristics［J］. Journal of Computing Science and Engineering, 2017, 11(2): 39-48.

[1]	BAI Xiongfei, GONG Shuicheng, LI Xuesong, XU Bo, YANG Xiaoli, WANG Mingyan. Defect Classification of Weld Metallographic Structure Based on Data Augmentation of Poisson Fusion [J]. Journal of Shanghai Jiao Tong University, 2023, 57(10): 1316-1328.
[2]	GAO Tao, WEN Yuanbo, CHEN Ting, ZHANG Jing. A Single Image Deraining Algorithm Based on Swin Transformer [J]. Journal of Shanghai Jiao Tong University, 2023, 57(5): 613-623.
[3]	QIAN Peng, WANG Guoliang, ZHU Wenfeng. Modeling and Optimization of 3D Assembly Tolerance for Window Lifting Under Flexible Deformation [J]. Journal of Shanghai Jiaotong University, 2020, 54(11): 1134-1141.
[4]	BAO Qinglin, CHAI Huaqi, ZHAO Songzheng, WANG Jilin. Model of Technology Opportunity Mining Using Machine Learning Algorithm and Its Application [J]. Journal of Shanghai Jiaotong University, 2020, 54(7): 705-717.
[5]	LI Baihe, JIANG Zuhua, TAO Ningrong, MENG Lingtong, ZHENG Hong. Ship Block Transportation Scheduling Considering Cooperative Transportation of Flatcars [J]. Journal of Shanghai Jiaotong University, 2020, 54(7): 718-727.
[6]	MA Zhonghang, ZHANG Zhinan. Design and Realization of a Versatile Simulation Platform for Telecontrol Multi-Rotor Unmanned Aerial Vehicle with a Robotic Arm [J]. Journal of Shanghai Jiaotong University, 2020, 54(6): 636-642.
[7]	MENG Lingtong, JIANG Zuhua, TAO Ningrong, LIU Jianfeng, ZHENG Hong. Multi-Stockyard Scheduling Considering Technological Process and Combined Assembly Block [J]. Journal of Shanghai Jiao Tong University, 2020, 54(4): 331-343.
[8]	ZHANG Jie,ZHAO Xinming,ZHANG Peng,SHENG Xia,CHAO Xiaona,TIAN Fengxiang. Early Warning Method for Tardiness Precaution Oriented to Rocket Final Assembly Process [J]. Journal of Shanghai Jiaotong University, 2020, 54(3): 322-330.
[9]	SUN Mingyang,YAN Guozheng,LIU Dasheng,WANG Zhiwu,HAN Ding,ZHAO Kai,YANG Lei. High Accuracy Ultra Wideband Real Time Location System for Drug Rehabilitation Center [J]. Journal of Shanghai Jiaotong University, 2020, 54(1): 76-84.
[10]	ZHANG Yungang,YANG Jianfeng,YI Benshun. Improved Residual Encoder-Decoder Network for Low-Dose CT Image Denoising [J]. Journal of Shanghai Jiaotong University, 2019, 53(8): 983-989.
[11]	WANG Hongyu,YIN Wurong,WANG Liang,HU Jianghao,QIAO Wenchao. Fast Edge Extraction Algorithm Based on HSV Color Space [J]. Journal of Shanghai Jiaotong University, 2019, 53(7): 765-772.
[12]	ZHOU Binghai,LIU Wenlong. Multi-Objective Hybrid Flow-Shop Scheduling Problem Considering Energy Consumption and On-Time Delivery [J]. Journal of Shanghai Jiaotong University, 2019, 53(7): 773-779.
[13]	MENG Lingtong,JIANG Zuhua,TAO Ningrong,LIU Jianfeng,LI Baihe. Combined Assembly Block Scheduling in Storage Yard of Shipbuilding [J]. Journal of Shanghai Jiaotong University, 2019, 53(7): 780-788.
[14]	JIANG Xudong，LI Pengfei，LIU Zheng，TENG Xiaoyan. Arterial Injury Assessment by Computational Interaction Model of Shear Thinning Blood with Expanded Stenotic Vascular [J]. Journal of Shanghai Jiaotong University, 2019, 53(6): 757-764.
[15]	TANG Ran,ZHAO Yingxin,WU Hong. Automatic Identification System Signal Detection Algorithm Based on Improved Feedback Decision [J]. Journal of Shanghai Jiaotong University, 2019, 53(5): 610-615.

Automated Web Page Content Extraction Method Based on Document Object Model

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments