基于窗口函数和分布式集群的可视化学术搜索系统数据查询优化

罗希意，霍晓阳，傅洛伊

doi:10.16183/j.cnki.jsjtu.2019.08.013

上海交通大学学报 >

2019 , Vol. 53 >Issue 8: 978 - 982

DOI: https://doi.org/10.16183/j.cnki.jsjtu.2019.08.013

学报（中文）

基于窗口函数和分布式集群的可视化学术搜索系统数据查询优化

展开

上海交通大学电子信息与电气工程学院，上海 200240

罗希意（1987-），男，重庆市人，硕士生，主要从事数据库和大数据处理研究.

网络出版日期: 2019-09-10

基金资助

科技部重点研发计划(2018YFB1004703)，上海市科学技术委员会创新行动计划(17511105103，18510761200)

收起

Query Optimization of Data Based on Window Function and Distributed Cluster in Visual Academic Search System

Expand

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Online published: 2019-09-10

Fold

摘要

针对在密集分析型查询请求和海量数据的应用场景下传统关系型数据库MySQL性能不佳问题，提出了基于窗口函数(Window Function)的分析型查询优化方法，以分区(Partitioning)方法代替传统的分组(Group by)操作，并提出了基于分布式集群(SQL-on-Hadoop: SparkSQL)计算引擎的海量数据查询优化方法，采用内存列存储优化技术和Spark分布式集群计算以提高查询性能.同时，以典型的分析型SQL查询实例验证了其有效性.结果表明，所提出的查询优化方法能够显著提高查询性能.与传统的关系型数据库MySQL相比，基于SparkSQL的查询优化方法的查询速度大幅提高，从而验证了其用于可视化学术搜索系统AceMap数据查询的正确性.

关键词： 结构化查询语言；窗口函数；分布式计算；查询优化

本文引用格式

罗希意，霍晓阳，傅洛伊 . 基于窗口函数和分布式集群的可视化学术搜索系统数据查询优化[J]. 上海交通大学学报, 2019 , 53(8) : 978 -982 . DOI: 10.16183/j.cnki.jsjtu.2019.08.013

Abstract

In order to address the issue of the poor performance of traditional MySQL database in application scenarios of densely analytical query requests and massive data processing, we proposed an approach based on window functions for analytical SQL query optimization. The approach replaces the fundamental grouping operation by the partitioning operation. In addition, we also designed distributed clusters based method for massive data query optimization, the method utilizes the in-memory columnar storage technology and Spark cluster’s distributed computation to lift the query performance. Meanwhile, the validity of the proposed approaches has been verified by typical analytical SQL queries. Experimental results show that the proposed methods have improved the query performance significantly, as the query optimization based on SparkSQL has reduced the execution time by a wide margin compared to traditional relational database MySQL. These proved the effectiveness when the methods are applied in AceMap, a visual academic search system.

Key words： structured query language (SQL); window functions; distributed computation; query optimization

参考文献

［1］WU Z H, WU J, KHABSA M, et al. Towards building a scholarly big data platform: Challenges, lessons and opportunities［C］//Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries. Piscataway, USA: IEEE Press, 2014: 117-126. ［2］SINHA A, SHEN Z H, SONG Y, et al. An overview of Microsoft academic service (MAS) and applications［C］//Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM Press, 2015: 243-246. ［3］SCHWARTZ B, ZAITSEV P, TKACHENKO V. High performance MySQL ［M］. 3rd edition. Sebastopol, USA: O’Reilly Media, 2012: 1-31. ［4］LEIS V, KUNDHIKANJANA K, KEMPER A, et al. Efficient processing of window functions in analytical SQL queries［C］//Proceedings of the VLDB Endowment. Hawaii, USA: Morgan Press, 2015, 8(10): 1058-1069. ［5］ARMBRUST M, XIN R, LIAN C, et al. Spark SQL: Relational data processing in spark［C］//Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 1383-1394. ［6］GUO C H, WU Z G, HE Z Y, et al. An adaptive data partitioning scheme for accelerating exploratory spark SQL queries［C］//International Conference on Database Systems for Advanced Applications. Suzhou, China: DASFAA Press, 2017: 114-128. ［7］CUZZOCREA A, BELLATRECHE L, SONG I. Data warehousing and OLAP over big data: Current challenges and future research directions［C］//Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP. New York, USA: ACM Press, 2013: 67-70. ［8］EISENBERG A, MELTON J, KULKARNI K, et al. SQL: 2003 has been published［EB/OL］.［2017-10-10］. https://dl.acm.org/citation.cfm?id=974142. ［9］RIGGS S, CIOLLI G, KROSING H, et al. PostgreSQL 9 Administration cookbook ［M］. 2nd edition. Birmingham, UK: Packt Press, 2015: 1-8. ［10］AGARWAL S, ARMBRUST M, BRADLEY J, et al. Spark documentations-spark overview［EB/OL］. ［2017-10-10］. http://spark.apache.org/committers.html. ［11］WHITE T. Hadoop: The definitive guide［M］. 4th edition. Sebastopol, USA: O’Reilly Media, 2015: 3-17. ［12］ABADI D, BABU S, ZCAN F, et al. Tutorial: SQL-on-hadoop systems［C］//Proceedings of the VLDB Endowment. Hawaii, USA: VLBD Press, 2015, 8(12): 2050-2051. ［13］PLATTNER H, ZEIER A, TINNEFELD C, et al. Available-to-promise on an in-memory column store: EP 2575093 ［P］. 2011-10-10［2013-04-03］.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献