DataOps, or data operations, is the latest agile operations methodology to spring from the collective consciousness of IT and big data professionals. It focuses on cultivating data management practices and processes that improve the speed and accuracy of analytics, including data access, quality control, automation, integration, and, ultimately, model deployment and management.
At its core, DataOps is about aligning the way you manage your data with the goals you have for that data. If you want to, say, reduce your customer churn rate, you could leverage your customer data to build a recommendation engine that surfaces products that are relevant to your customers — which would keep them buying longer. But that’s only possible if your data science team has access to the data they need to build that system and the tools to deploy it, and can integrate it with your website, continually feed it new data, monitor performance, etc., an ongoing process that will likely include input from your engineering, IT, and business teams.
Who Benefits From DataOps?
In a word, everyone.
Better data management leads to better — and more available — data. More and better data leads to better analysis, which translates into better insights, business strategies, and higher profitability. DataOps strives to foster collaboration between data scientists, engineers, and technologists so that every team is working in sync to leverage data more appropriately and in less time.
Companies that succeed in taking an agile and deliberate approach to data science are four times more likely than their less data-driven peers to see growth that exceeds shareholder expectations. It’s little wonder, then, that companies across the board are making data management changes that support more accessibility and innovation. Many of the disruptors we think of today — Facebook, Netflix, Stitch Fix, and others — have already embraced approaches that fall under the DataOps umbrella.
As Qubole Cofounder and CEO Ashish Thusoo writes in “Creating a Data-Driven Enterprise with DataOps,” an eBook offered by DataScience.com partner O’Reilly Media, “I joined Facebook in August 2007 as part of the data team. ...As was typical in those days, anyone in the company who wanted to get data beyond some small and curated summaries stored in the data warehouse had to come to the data team and make a request. Our data team was excellent, but it could only work so fast: it was a clear bottleneck.”
Facebook eventually democratized its data by introducing Hive, a data warehouse software project that allowed its team members to query data stored in a myriad of databases. The rest, as Thusoo notes, is history.
Where Does DataOps Come From?
DataOps is one of many methodologies born from DevOps, an approach to software development that Gartner predicts will be adopted by 80% of Global Fortune 1000 companies in the next year. The success of DevOps lies in bringing together the two separate groups that make up traditional IT: one that handles development work and one that does operational work. In a DevOps setting, software rollouts are fast and continuous because the entire team is united in detecting and correcting problems as they occur.
DataOps borrows and builds upon this idea, applying it across the entire lifecycle of data. Accordingly, DevOps concepts like continuous integration, delivery, and operations are now being applied to the process of productionizing data science: Data science teams are leveraging software version control solutions like GitHub to track code changes and container technology like Docker and Kubernetes to create environments for analysis and deploy models. This type of data science-meets-DevOps approach is sometimes referred to as “continuous analytics.”
How Do I Start Implementing DataOps?
As you probably suspected, there’s no one approach to implementing DataOps at your organization. There are, however, a few key areas of focus. Here’s where you should start:
Democratize Your Data
According to Experian Data Quality, 96% of chief data officers believe that business stakeholders are demanding more access to data than ever before, and 53% said lack of data access was the biggest barrier to driving better decision making. Yet, there’s plenty of data out there; by 2020, we’ll have generated 40 zettabytes — that’s 5,200 GB of data per human on earth.
As Thusoo saw during his time at Facebook, a lack of data access can create an insurmountable roadblock to innovation. Self-service data access and the infrastructure to support it are essential. Machine learning and deep learning applications require constant new data in order to learn and improve; any company that strives to be on the cutting edge needs datasets to be readily available.
Leverage Platforms and Open Source Tools
In a recent piece for Forbes, MapR VP of Technology Strategy Crystal Valentine says this of implementing DataOps: “First, at the tools layer, a DataOps practice requires a data science platform with support for the languages and frameworks beloved by the community (e.g., Python, R, data science notebooks and GitHub).” Also important? Platforms for data movement, orchestration, integration, performance, and more.
Part of being agile is not wasting time building things you don’t have to or reinventing the wheel when the tools your team already knows are open source. Consider your data needs and curate your tech stack accordingly. You can download our white paper, Open Source Tools for Enterprise Data Science, to learn more.
Automate, Automate, Automate
This one comes directly from the DevOps world: In order to achieve a faster time to value on data-intensive projects, it’s imperative that you automate steps that unnecessarily require lots of manual effort like quality assurance testing and data analytics pipeline monitoring.
Enabling self-sufficiency with microservices also plays into this. For instance, giving your data scientists the ability to deploy models as APIs means engineers can integrate that code where needed without refactoring, resulting in productivity improvements.
Govern With Care
It’s no coincidence that we’ve recently seen more companies taking a Center of Excellence approach to data science management. Until you’ve established a blueprint for success that addresses the processes, tools, infrastructure, priorities, and key performance indicators data science teams need to take into account, it’s unlikely you’ll get the return on investment you were expecting from data science — or DataOps.
Consequently, 62% of high-performers in this area have a data science development plan and road map in place, compared with only 28% of low performers and 29% of companies that are middle of the road.
Above all, collaboration is essential to implementing DataOps. The tools and platforms you embrace as part of the DataOps journey should support a larger goal of bringing teams together to use data more effectively.
“Keep in mind that data doesn’t belong to IT, data scientists, or analysts,” Thusoo writes. “It belongs to everyone in the business. So, your tools need to allow all employees to create their own analyses and visualizations and share their discoveries with their colleagues.”