针对在密集分析型查询请求和海量数据的应用场景下传统关系型数据库MySQL性能不佳问题,提出了基于窗口函数(Window Function)的分析型查询优化方法,以分区(Partitioning)方法代替传统的分组(Group by)操作,并提出了基于分布式集群(SQL-on-Hadoop: SparkSQL)计算引擎的海量数据查询优化方法,采用内存列存储优化技术和Spark分布式集群计算以提高查询性能.同时,以典型的分析型SQL查询实例验证了其有效性.结果表明,所提出的查询优化方法能够显著提高查询性能.与传统的关系型数据库MySQL相比,基于SparkSQL的查询优化方法的查询速度大幅提高,从而验证了其用于可视化学术搜索系统AceMap数据查询的正确性.
In order to address the issue of the poor performance of traditional MySQL database in application scenarios of densely analytical query requests and massive data processing, we proposed an approach based on window functions for analytical SQL query optimization. The approach replaces the fundamental grouping operation by the partitioning operation. In addition, we also designed distributed clusters based method for massive data query optimization, the method utilizes the in-memory columnar storage technology and Spark cluster’s distributed computation to lift the query performance. Meanwhile, the validity of the proposed approaches has been verified by typical analytical SQL queries. Experimental results show that the proposed methods have improved the query performance significantly, as the query optimization based on SparkSQL has reduced the execution time by a wide margin compared to traditional relational database MySQL. These proved the effectiveness when the methods are applied in AceMap, a visual academic search system.
[1]WU Z H, WU J, KHABSA M, et al. Towards building a scholarly big data platform: Challenges, lessons and opportunities[C]//Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries. Piscataway, USA: IEEE Press, 2014: 117-126.
[2]SINHA A, SHEN Z H, SONG Y, et al. An overview of Microsoft academic service (MAS) and applications[C]//Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM Press, 2015: 243-246.
[3]SCHWARTZ B, ZAITSEV P, TKACHENKO V. High performance MySQL [M]. 3rd edition. Sebastopol, USA: O’Reilly Media, 2012: 1-31.
[4]LEIS V, KUNDHIKANJANA K, KEMPER A, et al. Efficient processing of window functions in analytical SQL queries[C]//Proceedings of the VLDB Endowment. Hawaii, USA: Morgan Press, 2015, 8(10): 1058-1069.
[5]ARMBRUST M, XIN R, LIAN C, et al. Spark SQL: Relational data processing in spark[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 1383-1394.
[6]GUO C H, WU Z G, HE Z Y, et al. An adaptive data partitioning scheme for accelerating exploratory spark SQL queries[C]//International Conference on Database Systems for Advanced Applications. Suzhou, China: DASFAA Press, 2017: 114-128.
[7]CUZZOCREA A, BELLATRECHE L, SONG I. Data warehousing and OLAP over big data: Current challenges and future research directions[C]//Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP. New York, USA: ACM Press, 2013: 67-70.
[8]EISENBERG A, MELTON J, KULKARNI K, et al. SQL: 2003 has been published[EB/OL].[2017-10-10]. https://dl.acm.org/citation.cfm?id=974142.
[9]RIGGS S, CIOLLI G, KROSING H, et al. PostgreSQL 9 Administration cookbook [M]. 2nd edition. Birmingham, UK: Packt Press, 2015: 1-8.
[10]AGARWAL S, ARMBRUST M, BRADLEY J, et al. Spark documentations-spark overview[EB/OL]. [2017-10-10]. http://spark.apache.org/committers.html.
[11]WHITE T. Hadoop: The definitive guide[M]. 4th edition. Sebastopol, USA: O’Reilly Media, 2015: 3-17.
[12]ABADI D, BABU S, ZCAN F, et al. Tutorial: SQL-on-hadoop systems[C]//Proceedings of the VLDB Endowment. Hawaii, USA: VLBD Press, 2015, 8(12): 2050-2051.
[13]PLATTNER H, ZEIER A, TINNEFELD C, et al. Available-to-promise on an in-memory column store: EP 2575093 [P]. 2011-10-10[2013-04-03].