With Grunion, data scientists can translate queries from one language to another quickly, effectively reducing the need for expensive ETL processes and reliance on engineering
LOS ANGELES, CA--(Marketwired - Mar 14, 2017) - DataScience, Inc., today announced the release of Grunion, a patent-pending query optimization, translation, and federation framework built on top of Apache Calcite and integrated into Apache Spark. Designed to bridge the gap between data science and engineering teams by removing the need to manually translate code from one language to another, Grunion is the first project out of DataScience Labs, the company's testing ground for experimental data science projects.
Grunion limits the need for expensive and slow ETL processes by providing a unified query language and APIs to push down complex query operators, joins, functions, and aggregations into SQL and NoSQL databases. But Grunion's most compelling feature is its ability to integrate with Spark SQL's Catalyst optimizer, essentially turbocharging its capabilities.
"Spark's level of support for pushing down queries into data sources is limited," said Jason Slepicka, senior data engineer at DataScience. "With Grunion, you can push down just about anything into a SQL or NoSQL database that the database supports, and at an accelerated speed. We tested Grunion on the TPC Benchmark™ DS, an industry standard for measuring performance in big data systems, and discovered that it can fully push down and parallelize Spark SQL queries against a relational database to achieve execution times 10 to 30 times faster than Spark can achieve alone."
Grunion enhances DataScience's enterprise platform, the DataScience Cloud, where users can deploy models built in their language of choice without rewriting code into a production stack language or PMML. The platform also allows notebooks, models, and other files to be grouped together in the same repository or project, regardless of the language they were written in. Grunion helps facilitate these capabilities with four main components:
- Languages: Grunion can parse any script into an intermediate representation for optimization and rewriting. Grunion supports scripts written in SQL languages like PostgresSQL, Redshift SQL, MySQL, and SparkSQL; NoSQL languages like MongoDB; workflow languages like Pig Latin; and, most importantly, the Spark DataSet API.
- Compilers: Grunion converts the intermediate representation into scripts for a given language. Grunion also checks which features each language supports, such as data types, functions, joins, window functions, and set operators, to ensure only supported features are pushed down, while providing mappings to make sure the pushed-down queries are semantically equivalent across databases.
- Interpreters: Grunion can task other systems to execute the scripts generated from an intermediate representation. An interpreter can be anything from a NoSQL database like MongoDB, a relational database like MySQL, PostgreSQL, or Redshift, or a system like Presto, Flink, Drill, or Spark.
- Translators: Grunion can parse scripts in one language, like PostgreSQL, and compile the script into another language like the Scala Spark Datasets API, essentially reducing the effort required from engineering to operationalize data science work.
"The idea behind Grunion -- and behind the DataScience Cloud as a whole -- is that data scientists need a way to make the work they do valuable to their whole organization, without relying heavily on outside resources like engineering," said DataScience CSO William Merchan. "By releasing Grunion, we're sharing some of those important capabilities with the larger data science community."
To run with Grunion, or to request a demo of the DataScience Cloud, please visit www.datascience.com.
DataScience, Inc. provides the DataScience Cloud, an enterprise data science platform that combines the tools, libraries, and languages data scientists know with the infrastructure and workflows their organizations need. The DataScience Cloud maximizes the way data scientists like to work, so they can solve the right problems, create better analyses, amplify their results, and put more work into production -- all from one place. To learn more, or to request a demo, visit www.DataScience.com.