Today, DataScience announced the launch of Grunion, a patent-pending query optimization, translation, and federation framework built to help bridge the gap between data science and data engineering teams.
Grunion is the first project from DataScience Labs, our testing ground for experimental data science projects. Grunion doesn't execute queries itself, but instead issues them to databases and other query engines. It extends the data management framework, Apache Calcite — like other important Apache projects such as Drill, Flink, and Storm — in a novel way to create a pluggable and composable infrastructure.
The purpose of Grunion is to streamline data science workflows by lessening reliance on engineering. As data science teams mature, they face a growth in both data volume and sources. Learning the syntax and features of new query languages, managing the complexity and latency of ETL from multiple, evolving sources, and dealing with the ins and outs of MPP systems requires data scientists to either spend less time doing data science or lean heavily on a team of data engineers to operationalize their workflows. Grunion aims to change that paradigm.
We think Grunion’s most compelling integration is as a drop-in turbocharger for Spark’s Catalyst optimizer. Spark’s support for pushing down queries into underlying data sources is limited to predicates and projections. Grunion, however, can push down just about anything into a SQL or NoSQL database that the database supports. Grunion also allows you to take advantage of your existing Spark workflows and machine learning libraries with almost no changes.
Grunion defines four main components — languages, compilers, interpreters, and translators. Here's how it works:
Languages allow Grunion to parse any script into an intermediate representation for optimization and rewriting. Grunion supports scripts written in SQL languages like PostgresSQL, Redshift SQL, MySQL, and SparkSQL; NoSQL languages like Pig Latin; and, most importantly, the Spark DataSet API, by converting Spark’s QueryPlans created by Spark’s internal query optimizer, Catalyst.
Compilers allow Grunion to convert its intermediate representation back into scripts for a given language. For each language, compilers enrich Calcite with a set of tables which enumerate the features the language supports, like functions, joins, window functions and set operators, while providing mappings to make sure the pushed-down queries are semantically equivalent across databases.
Getting the day of week from a timestamp is a straightforward example of a not-so-straightforward implementation across databases that Grunion handles gracefully. In MySQL, for example, DAYOFWEEK(timestamp) starts Sunday at 1, but PostgreSQL EXTRACT(DOW FROM TIMESTAMP) starts Sunday at 0. That’s an off-by-one error that would otherwise come as a surprise when federating queries.
Interpreters allow Grunion to task other systems to execute the scripts generated from the intermediate representation. An interpreter can be anything from a NoSQL database like MongoDB; a relational database like MySQL, PostgreSQL, or Redshift; or a system like Presto, Flink, Drill or Spark. Each interpreter also adds new rules to Calcite’s cost-based optimizer. In this way, we can define rules that let us compose interpreters and create federated query plans that balance pushing down as much as possible into underlying databases with parallelizing subqueries in a system like Spark.
Translators were serendipitous for Grunion. After Grunion’s infrastructure components were defined, it was only natural to take languages and compilers and skip the interpreters all together in order to parse scripts in one language and compile the scripts in another language. This allows a data scientist to take the hundreds of lines of query she wrote in Redshift SQL and simplify the handoff to a data engineer by generating Scala code that uses the Spark DataSets API. A translator can help migrate legacy workflows or even serve as an educational tool for a data scientist who is learning a new query language.
How does Grunion stack up?
This all sounds well and good, but how much does optimization and pushdown support actually help? To demonstrate the impact pushing down more than predicates and projections into underlying databases can have, we decided to use the TPC Benchmark™ DS. TPC-DS is an industry standard for measuring performance in big data systems. Amazon, Cloudera, Databricks, and IBM have all published results comparing their systems using it. For those unfamiliar with it, the benchmark models a mix of interactive exploratory queries and BI workloads for a brick-and-mortar retail company with an e-commerce presence.
Fortunately, Databricks released an extensible project called spark-sql-perf for running the benchmark on Spark on GitHub. We used it to compare Spark’s native query performance on a PostgreSQL database against Grunion’s Spark implementation. All it took was replacing the SparkSession with a GrunionSession. Then, Grunion used a combination of pushing-down queries into PostgreSQL, issuing subqueries for Spark to execute in parallel and automatically defining table partitions for Spark to parallelize within a query to outperform Spark and Postgres natively.
For our benchmark runs, we loaded TPC-DS data at scale of 10GB of data into an Amazon RDS db.m4.xlarge PostgreSQL instance and indexed the data according to suggestions by Tomas Vondral. We then launched a 6-node EMR cluster using m4.xlarge instances (1 master and 5 slaves). On that cluster, we defined tables in a Hive metastore and ran SparkSQL’s ANALYZE TABLE with COMPUTE STATISTICS FOR COLUMNS for each table. Finally, we ran the benchmark as a Spark job on YARN using Databricks' spark-sql-perf project.
We grouped the queries according to the categories Cloudera used to present their results and plotted them. Grunion performed 25 times better for the reporting queries, 12.5 times better for interactive queries, and roughly 3 times better for analytic and common queries. Almost all the queries were entirely pushed down and PostgreSQL hardly broke a sweat. Reducing the data that has to travel back to Spark through aggregations and highly selective joins or parallelizing the underlying subqueries made all the difference. Our results also mean the queries could have been run from a single machine, instead of a cluster, for a fraction of the cost.
Ready to run with Grunion? You can request access here.
Sign up today to receive the latest DataScience content in your inbox.