Data science, machine learning (ML) and artificial intelligence (AI) are relatively new endeavors for enterprise-level business. Many companies are batch training as well as batch scoring ML models. Predictions are stored in a database to be retrieved either by applications or humans. However, real-time training on streaming data and near real-time scoring of models in the hot path is something many enterprises aim to do but struggle to achieve.
Those of us who have spent the last few years building enterprise-scale data science and ML/AI capabilities know that there is no real playbook to follow. In many ways, based on the needs and use cases of the business, we’ve invented strategies along the way. In this article, I will give a high-level overview of my experiences building an enterprise-level data science and ML/AI capability from the ground up. Others who’ve taken a similar journey may find similarities as well as differences.
Motivation, Challenges, and Solutions
The impetus for building this enterprise-scale ML/AI capability came about because I had built a machine learning model that could predict an important element of information needed by the business. This model, if productionalized, held the promise of being able to speed up transactions and save significant amounts of money for the business.
The initial challenges to operationalize the model were numerous. I’ll mention six of them below.
First, the model started to decay three hours after it was trained and was virtually useless at the twenty-four-hour mark. This meant that we needed regular retraining of models and the training needed to be done as close to the time the data was produced as possible.
Second, the volume of potential training data was up to 40 GB per hour.
Third, since the results of the model had a time element—meaning, if the model was queried this minute it may give a different response than if queried a minute earlier—the model had to be deployed in the hot path of the transaction to get accurate responses.
Fourth, since I did not know how much compute I would ultimately need at the beginning of the project, I wanted to use the scalability of the cloud to do model training. Given this, I needed to stream large amounts of data to the cloud for data transformation and model training, and then send the trained model back to on-premises for deployment.
Fifth, transactions needed to be returned to the customer in under one second so scoring the model needed to be extremely fast. Queries needed to be returned in microseconds rather than milliseconds.
Sixth, DevOps and monitoring for ML/AI needed to be created and all cloud monitoring needed to be integrated into the enterprise monitoring system.
Despite the challenges, I started a project to build a real-time AI capability that would allow production applications to query machine learning models during a transaction. The project called for retraining machine learning models in near real-time by streaming live data to a machine learning cluster. To take advantage of elastic and scalable compute in the cloud, the project proposed streaming large volumes of data across the network from the on-premises data center to the cloud.
Two years later, this project is now complete and we had built a solid data science and ML/AI deployment capability as a result. What follows are some key takeaways that will perhaps be instructional for others. Based on my experience, a company embarking on an enterprise-scale ML/AI project should pay attention to the following:
The most important thing to realize when starting a large enterprise data science project that deploys AI is that the correct infrastructure needs to be in place or built out either on-premises or in the cloud. My strategy was to leverage the cloud as much as possible so that we did not have to purchase hardware that may end up being underutilized, especially as we optimized. Nonetheless, given that many of the applications that were going to consume the ML/AI models were on premises and that the data originated on premises, I had to ensure that proper infrastructure was available.
In my case, I utilized Kafka (MirrorMaker) to transfer the data to a cloud-based Kafka cluster which then streamed the data into a Spark-streaming cluster for transformations, and eventually to an H2O cluster to build ML/AI models. Even though we had a Kafka cluster on premises, I needed to ensure that the cluster would work for the scale of data my project needed. The same applied to model deployment. Time series-based models are often deployed using some type of REST API through which the models can be queried. However, if the models are to take high transactions per second (TPS), one needs to make sure that there is enough compute available to host the models—they will need their own compute cluster.
It is also important to find out ahead of time whether internal networks will be able to handle the new load to the model without creating network problems. How models are deployed is also important. Sometimes, it is better not to use a REST service. For example, in one case, the best way to deploy the model was as a dynamic-link library (DLL) hosted on the same server as the application.
Talent (Beyond Data Scientists)
As data scientists, we often don’t think outside of the data and the models themselves. For many data scientists, the goal is to create a highly performing model that predicts with a high degree of accuracy. But in an enterprise data science project, a high-performing model is only step one. Though I cannot overstate the importance of having a team of highly qualified and competent data scientists, without the infrastructure and talent to operationalize the work, the true value of data science may never get unlocked for the business. An enterprise scale data science effort requires an entire cast of talent.
Big data engineers are needed to help move the data around and create the pipelines to feed the models. Equally important are talented software engineers to create the methods (APIs, DLLs, etc.) through which the models are deployed. Software engineers are also needed to integrate the models with the applications that will consume the predictions from the models.
One of the key things to keep in mind is that while there are many cool and new technologies out there in the field of ML/AI and data science, when deploying an ML/AI or data science solution to production within an enterprise, production support is needed. This is necessary if bug fixes are needed or new features are requested, especially if these AI capabilities become vital elements within the company’s revenue-generating applications. TensorFlow, for example is a machine learning framework. But as of this writing, you cannot buy enterprise support for TensorFlow. H2O, on the other hand, is an ML/AI framework that is both open-source and has the option to purchase enterprise support for the same people who built it.
When utilizing cloud, the choice between infrastructure as a service (IaaS) or platform as a service (PaaS) is also something you need to consider in terms of technology and cost. PaaS is more expensive and so, for cases where compute was consistently used, I opted to deploy on a cluster of VMs rather than on a PaaS offering. If you opt to go with IaaS, it is important to realize that securing and maintaining the cluster becomes your full responsibility.
Data scientists mostly work in the lab. They have their own environments and utilize Jupyter notebooks or Rstudio to experiment and build models. In that environment, there are many items that can be handled manually. But when the goal is to operationalize, automate, and push a ML and AI solution to production, numerous additional elements need to be considered.
For example, when building a machine learning model, many of us use a grid search to find the best hyperparameters. Often, however, the best hyperparameters will change over time. It is unreasonable to ask data scientists to check for new hyperparameters for models that are retrained multiple times a day. Grid search, therefore, needs to be automated and run periodically, and the results should be automatically incorporated into the model building process. Monitoring the performance of models is also vital. If the metrics important to you (AUC, F1, F2, Accuracy, etc.) fall below a predetermined acceptable threshold, it might indicate that there is a problem with the data or the model and that it needs to be addressed.
You might also need to determine your own metrics based on the business case. For my project, the standard metrics used to evaluate model performance did not suffice. Therefore, each time we trained a model, we created a scorecard that indicated the thresholds to trust the model given a specific risk tolerance for false negatives or false positives. All of this needed to be automated and configurable.
Another element that needs to be considered is model type. Often data scientists want to deploy creative feature engineering techniques and use sophisticated modeling methodologies. While these methodologies might create models with greater predictive accuracy, when deploying them to be scored in production, latency might become an issue. In addition, any data transformations that result from feature engineering needs to be done when querying the model, and if they include multiple joins with other data sets, it can add to latency when scoring and will require additional engineering work on the model querying side.
There are also vital considerations in terms of scalability when dealing with large volumes of data and productionalizing ML/AI solutions. In my project, we decided to use H2O for machine learning and model building, but for data transformations we used Spark streaming. The Sparkling Water library can be used to transfer data from a Spark data frame to a H2OFrame. The problem we ran into on very large datasets was the time it took to do that transfer and the additional RAM that was needed to convert from a Spark DataFrame to an H2OFrame because it requires data duplication. This created a bottleneck. The workaround was to save the data from Spark to HDFS in Parquet format and then read the Parquet files into H2O, thus avoiding any in-memory copying. This is something that may not be needed on smaller data.
In addition, deploying models as a REST service with a high TPS requirement becomes tricky. To solve, we containerized the models using Docker and used a blue green deployment strategy to swap out old models as new ones arrived. However, we still found bottlenecks related to I/O. Amdahl's Law came into effect and time was spent overcoming some of those issues. When building an AI deployment capability on enterprise scale, data scientists need to work closely with engineers and the engineers need to be able to build solutions that not only work but can scale properly.
In any professional technology organization, DevOps best practices should be implemented. This means ensuring that all code, including data scientists’ experimental code, is in source control. The ability to automate build and deploy on all elements of the data science project, from the pipelines to building the models to model deployment is also vital.
Monitoring is also a huge part of the DevOps practices. The more one can monitor, the better off one is when one needs to troubleshoot. My goal was to monitor all aspects of our project. As mentioned, some of the models we created were time-dependent. This means that the time difference between now and a future event significantly influences the predictions made by the model. One day, we found that our models were not performing properly. After investigating, we found that as data volumes increased another team had added servers to enhance our ability to handle the larger data volume. As upstream users of the data, this would normally have had no impact on our ability to build accurate models. Except that the system time on the new servers were not configured properly and therefore the timestamps were off by a few hours. This negatively influenced the performance of our models.
Because of this, we started to analyze the timestamps from our on-premises servers compared to the cloud server in our Spark streaming job, and if there was a negative difference an alert would be created. So in addition to monitoring Kafka and Spark metrics, server health, CPU usage and model performance metrics, there are numerous other data-specific metrics that need to be monitored. Setting up that kind of monitoring system using technologies such as Influx, Graphite, and Grafana was a project in and of itself.
AI and ML are still relatively new and, in some cases, is perceived to be unproven, especially in the eyes of some old school engineers. Many do not understand that data science uses test data to prove their models and think that until the models get into production that it is unproven technology. So as a data scientist in the enterprise, your role is to be the Explainer-in-Chief of how ML/AI works and the impact this technology is having and will have on the world.
It is so important to be specific and show how AI and ML can and will impact your business. Celebrate each win along the way. Even if it is only a limited or partial success. Bring others along with you and explain what your results mean. Never be siloed, rather, allow colleagues from other areas of the company to feel pride in, and a part of, the work and successes that you and your team achieve.
The Future of Enterprise-Level Data Science
Despite the current challenges and relative infancy of this space, in time, the deployment of AI predictions will be ubiquitous within enterprises. In some companies, especially those that are AI -first, this is already happening. Operationalized machine learning will be used for all reasonably complex dilemmas, not only to assist executive decision-making, but also—and perhaps chiefly—to develop enterprise-scale software applications. Currently, if/else statements and complex functions encoded with heuristics and business domain knowledge are used. In the future, in the first instance, machine learning models will be deployed, which will enable enterprise-scale applications to become truly intelligent.
This is the journey we are on. There are, however, still many challenges to overcome as many enterprises struggle with basic data-related issues. Automated AI deployment is still a few years off for most enterprises, but it is something that all enterprises that are serious about doing business in the 21st century must master—it is simply a do-or-die proposition. Those of us working in the space are pioneering systems that will inevitably become standard operating procedure for all software development and system design. In order to get there, however, there is a huge amount of exciting, challenging, and pioneering work left to be done.