For information technology (IT) teams managing the many applications required by large enterprise companies today, container runtime and orchestration tools like those offered by Docker, Kubernetes, and Cloud Foundry offer a great developer experience while helping to standardize security and optimize server utilization. Forrester Principal Analyst Dave Bartoletti reports that while only 10% of enterprise companies are currently using containers in production, up to a third are testing them — in many cases, for data science-related applications.
While traditional virtual machines (VM) or bare-metal servers (physical servers without a virtualization layer) can be used for data science work, setting up or making changes to these systems can be cumbersome. Each time a new server is procured, IT must install the tools necessary for data science work, such as RStudio, Jupyter, or various R and Python packages. And in the case that a data scientist’s analysis requires extra computing power, configuration management tools or advanced automation are needed to keep the process of providing additional resources from becoming unwieldy.
Containers offer isolation with less overhead than VMs or physical servers, enabling four to six times the number of server application instances as traditional virtual machines when installed on the same sized hardware. Even better, once IT has created an “image” of a container — a template that can be used to launch an environment — data scientists across the organization can use that image to create new environments as needed.
For these reasons and many others, it’s little wonder that IT teams managing the work of dozens or even hundreds of data scientists are turning to containers to ease the deployment of data science environments and models. But how does an IT team go about efficiently creating the images data scientists need and then managing the resulting environments? Every organization approaches these challenges differently, but below are a few best practices to keep in mind as you use containers to scale your data science efforts.
Managing Resources: How to Never Run Out of Pie
Let’s pretend that, collectively, your group of servers represents a pie. Each time a data scientist launches a new environment, he or she is taking a slice out of that pie. Because containers are so versatile, these pie slices can be of varying sizes — but some slices are going to be unavailable much longer than others. That model powering the product recommender system on your website? It’s going to be hogging a slice until you decide to take it out of production, which could be six months from now. Keeping some of the pie in the tin might be more complicated than you anticipated.
For IT managers, this is all part of a delicate balancing act. One of the many benefits of using containers is that your data scientists can launch them as needed, but that means any number can be launched at any time. A Jupyter or RStudio container for interactive analysis could live for days, weeks, or even months between shutdowns. Scheduled or ad hoc scripts could run for seconds, minutes, or hours, depending on the scope of the work being done. Deployed models could run for months. The majority of companies are deploying code weekly, multiple times per week, or multiple times per day (Amazon is deploying code every 11.7 seconds!), so keeping these systems running smoothly requires the ability to do two things well: monitor resources and resize your capacity accordingly.
To do these things, you need a framework or platform with a cluster manager to track resources like memory, CPU, or storage. As your pie disappears, you can use the cluster manager to enlarge your pie (by adding servers) from which your data scientists can cut another slice. While the goal is to never run out of pie, it’s also important to remember that pie costs money. If you’re working in a cloud like Amazon Web Services or Microsoft Azure, you could be wasting between 30% and 45% of the resources you’re paying for. However, IT managers can drastically reduce that percentage by creating presets for common container sizes and keeping the available capacity optimized for anticipated demand.
Image Building: Separating the Nice to Haves from the Must Haves
If IT is in charge of creating and maintaining a library of docker images for one or more data science teams, there are two scenarios in particular that can lead to headaches. One is the creation of a single, bloated base image containing everything every data scientist could possibly need. The other is the creation of an unmanageably large library of smaller images for bespoke tasks or individual scientists.
Both of these approaches are extreme and introduce unnecessary burdens on IT. To avoid these pitfalls, IT can work with data science teams to determine what really belongs in a base image — broadly useful tools or packages that your team is using daily, and especially ones that take a long time to install. Surveying your data scientists about the tools they use regularly is a good place to start; from there, you can determine what really needs to be in an environment every time it launches. Datasets that are very large, are updated regularly, or are minimally reusable should not be part of that list. However, including small datasets that are rarely changed but commonly used can provide a productivity boost for end users.
But how many base images should you really have? That depends on your org structure. It might make sense to map your base images to different teams — perhaps the data scientists working with your marketing department use a different set of tools than the ones building the models that assess whether potential customers qualify for a credit card, for instance. It might also make sense to map base images to major language families (such as Python 2, Python 3, or R).
Keep in mind, just because a certain tool didn’t make it into the base image doesn’t mean data scientists can’t use it — it’s possible for them to install packages at runtime or in a notebook long after a container has been launched (they’ll have to wait for those packages to install, of course). You can also choose to distinguish between base images that contain the building blocks of analyses at your company and the other, more specific, images that depend on them. In our platform, we call those user environments, and they generally contain the things a user needs to launch a session, schedule a script, or publish an application, among other actions.
Future Proofing: Preparing for Changing Data Science Tool Needs
What happens if data science work at your organization changes, and suddenly a package data scientists have been installing at runtime proves to be worth adding to your base image? Can you just update the image everyone is using? How will that affect the work that was created in the previous version of your environment?
First, you shouldn’t update your base image. One of the value adds of using containers is that they keep data scientists from writing code on their local machines that might not run in an environment containing different package versions or tools; changing the underlying image would disrupt that system. You also can’t just delete the existing base image and start over; that would mean that everything built on that base image would have to be rescheduled or moved.
You can, however, use versioning best practices when creating a new image. In Docker, you can “tag” images so your users know which ones to use. For example, you could tag your images with a name and date, like “Deep Learning Environment: v22.214.171.1247” or even “Deep Learning Environment: latest” if you really want to get the point across. It’s also good practice to add tags to your previous environments indicating they are out of date. You’ll still have to keep them around, but your data science team will know which image to download from your registry.
Using a Platform With Environment Management Features
How you manage containers at your organization will also largely depend on the tools you’re using. The DataScience.com Platform has an environment management feature built in that allows IT teams to create and manage images and resource settings. Data scientists on the platform can then launch containers from those images with the click of a button.
Ready to start standardizing data science work across your organization? Request a demo of our platform today.