上海交通大学学报(自然版) ›› 2018, Vol. 52 ›› Issue (10): 1363-1369.doi: 10.16183/j.cnki.jsjtu.2018.10.027
李桐宇,任锐,蔡鸿明,姜丽红
通讯作者:
蔡鸿明,男,教授,博士生导师,电话(Tel.):021- 34205153;E-mail: hmcai@sjtu.edu.cn.
作者简介:
李桐宇(1995-),男,江苏省睢宁县人, 硕士生,主要研究方向为网页内容及语义标签提取.
基金资助:
LI Tongyu,REN Rui,CAI Hongming,JIANG Lihong
摘要: 网页内容提取在信息检索、文本分析以及网络资源数据处理等领域具有重要的工程与应用价值.针对网页中的大量无关内容及网页结构的异构性所造成的网页内容提取难题,提出一种基于文本对象模型(DOM)的自动化网页内容提取方法.首先,在节点过滤后,对网页的DOM模型进行压缩,便于后续分析处理;然后,提出基于文本-链接密度的内容提取方法来识别网页内容;最后,基于节点熵来识别并去除网页内容中的噪声链接.实验结果表明,相比于传统的网页内容提取方法,该方法的准确率和F1分数均有明显提升,而召回率仅有轻微下降.
中图分类号:
李桐宇,任锐,蔡鸿明,姜丽红. 基于文本对象模型的自动化网页内容提取方法[J]. 上海交通大学学报(自然版), 2018, 52(10): 1363-1369.
LI Tongyu,REN Rui,CAI Hongming,JIANG Lihong. Automated Web Page Content Extraction Method Based on Document Object Model[J]. Journal of Shanghai Jiaotong University, 2018, 52(10): 1363-1369.
[1]WENINGER T, PALACIOS R, CRESCENZI V, et al. Web content extraction: A meta-analysis of its past and thoughts on its future[J]. ACM SIGKDD Explorations Newsletter, 2016, 17(2): 17-23. [2]BORGOLTE K, KRUEGEL C, VIGNA G. Relevant change detection: A framework for the precise extraction of modified and novel web-based content as a filtering technique for analysis engines[C]//Proceedings of the 23rd International Conference on World Wide Web. Seoul: ACM, 2014: 595-598. [3]PETPRASIT W, JAIYEN S. E-commerce web page classification based on automatic content extraction[C]//2015 12th International Joint Conference on Computer Science and Software Engineering (JCSSE). Songkhla: IEEE, 2015: 74-77. [4]KADAM V, DEVALE P R. A methodology for template extraction from heterogeneous web pages[J]. Indian Journal of Computer Science and Engineering (IJCSE), 2012, 3(3): 449-452. [5]WU S, LIU J, FAN J. Automatic web content extraction by combination of learning and grouping[C]//Proceedings of the 24th international conference on World Wide Web. Switzerland: International World Wide Web Conferences Steering Committee, 2015: 1264-1274. [6]KIM M, KIM Y, SONG W, et al. Main content extraction from Web documents using text block context[C]//International Conference on Database and Expert Systems Applications, Prague. Berlin, Heidelberg: Springer, 2013: 81-93. [7]REIS D C, GOLGHER P B, SILVA A S, et al. Automatic web news extraction using tree edit distance[C]//Proceedings of the 13th international conference on World Wide Web. New York: ACM, 2004: 502-511. [8]杨柳青, 李晓东, 耿光刚.基于布局相似性的网页正文内容提取研究[J].计算机应用研究, 2015, 32(9): 2581-2586. YANG Liuqing, LI Xiaodong, GENG Guanggang. Study of web pages content extraction based on layout similarity[J]. Application Research of Computers, 2015, 32(9): 2581-2586. [9]CAI D, YU S, WEN J R, et al. VIPS: A vision-based page segmentation algorithm[R]. Beijing: Microsoft, 2003. [10]WANG P, ZHOU M, YOU Y, et al. A new vision-based method for extracting academic information from conference Web pages[C]//IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI). Athens: IEEE, 2012: 976-981. [11]SUN F, SONG D, LIAO L. Dom based content extraction via text density[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. Beijing: ACM, 2011: 245-254. [12]FU L, MENG Y, XIA Y, et al. Web content extraction based on webpage layout analysis[C]//Second International Conference on Information Technology and Computer Science (ITCS). Kiev: IEEE, 2010: 40-43. [13]WENINGER T, HSU W H. Text extraction from the web via text-to-tag ratio[C]//19th International Workshop on Database and Expert Systems Application (DEXA). Turin: IEEE, 2008: 23-28. [14]ZHENG X, GU Y, LI Y. Data extraction from web pages based on structural-semantic entropy[C]//Proceedings of the 21st International Conference on World Wide Web. Lyon: ACM, 2012: 93-102. [15]LIU Q, SHAO M, WU L, et al. Main content extraction from web pages based on node characteristics[J]. Journal of Computing Science and Engineering, 2017, 11(2): 39-48. |
[1] | 钱鹏, 王国亮, 朱文峰. 柔性变形下车窗升降三维装配公差建模及优化[J]. 上海交通大学学报, 2020, 54(11): 1134-1141. |
[2] | 包清临, 柴华奇, 赵嵩正, 王吉林. 采用机器学习算法的技术机会挖掘模型及应用[J]. 上海交通大学学报, 2020, 54(7): 705-717. |
[3] | 李柏鹤, 蒋祖华, 陶宁蓉, 孟令通, 郑虹. 考虑平板车合作运输的船舶分段堆场间调度[J]. 上海交通大学学报, 2020, 54(7): 718-727. |
[4] | 马仲航, 张执南. 多旋翼无人机遥操机械臂多功能仿真实验平台的设计与实现[J]. 上海交通大学学报, 2020, 54(6): 636-642. |
[5] | 孟令通, 蒋祖华, 陶宁蓉, 刘建峰, 郑虹. 考虑工艺顺序和组合分段的多堆场调度方法[J]. 上海交通大学学报, 2020, 54(4): 331-343. |
[6] | 张洁,赵新明,张朋,盛夏,晁晓娜,田凤祥. 面向火箭总装过程的工期延误预警方法[J]. 上海交通大学学报, 2020, 54(3): 322-330. |
[7] | 孙铭阳,颜国正,刘大生,王志武,韩玎,赵凯,杨雷. 基于超宽带技术的强制戒毒人员实时定位系统[J]. 上海交通大学学报, 2020, 54(1): 76-84. |
[8] | 章云港,杨剑锋,易本顺. 低剂量CT图像去噪的改进型残差编解码网络[J]. 上海交通大学学报, 2019, 53(8): 983-989. |
[9] | 王红雨,尹午荣,汪梁,胡江颢,乔文超. 基于HSV颜色空间的快速边缘提取算法[J]. 上海交通大学学报, 2019, 53(7): 765-772. |
[10] | 周炳海,刘文龙. 考虑能耗和准时的混合流水线多目标调度[J]. 上海交通大学学报, 2019, 53(7): 773-779. |
[11] | 孟令通,蒋祖华,陶宁蓉,刘建峰,李柏鹤. 船舶组合分段堆场调度方法[J]. 上海交通大学学报, 2019, 53(7): 780-788. |
[12] | 江旭东,李鹏飞,刘铮,滕晓艳. 基于剪切稀化效应的血液流体-扩张血管耦合模型的血管损伤分析[J]. 上海交通大学学报, 2019, 53(6): 757-764. |
[13] | 唐然,赵迎新,吴虹. 基于改进反馈判决的自动识别系统信号解调算法[J]. 上海交通大学学报, 2019, 53(5): 610-615. |
[14] | 叶仙,胡洁,田畔,戚进,车大钿,丁颖. 基于精细复合多尺度熵与支持向量机的睡眠分期[J]. 上海交通大学学报(自然版), 2019, 53(3): 321-326. |
[15] | 沈婷,孙锬锋,蒋兴浩. 基于双编码参数模型的同量化参数双压缩检测算法[J]. 上海交通大学学报(自然版), 2019, 53(3): 334-340. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||