Most companies today start by building data processing and data science platforms using cloud providers. This gives teams speed and flexibility when starting out. However, over time the costs of scaling a data pipeline in the cloud starts to become a burden on the business and the architecture of the software and the hardware used needs to adapt to the new scale. In this blog, I will walk you through the Pure Storage data science teams’ journey through the process of scaling, and the benefits we reaped from moving our workflows to the hybrid cloud.

To set the context, I work within Pure1, the cloud-based monitoring, fault management, and forward planning service for Pure Storage appliances that sit in customer data centers. These appliances send encrypted metadata, including logs, performance statistics, and alerts to Pure1 servers via the Internet every few seconds. Currently, systems send tens of terabytes per day to our servers. Our job, as the data science team, is to get insights from this data and help our users. To get there, we built a pretty standard data processing pipeline, as mapped in the steps below:


1. Ingest

Connected systems constantly send performance, status, and alert information to Pure1 phonehome servers (Public cloud virtual machines).

2. Archive

As data ages and is accessed less frequently, it moves to a lower storage tier to reduce cost.

3. Analyze

Automated scripts continually scan installed base data looking for conditions that match the “fingerprints” of known issues so that problems can be forestalled before they arise. Developers also analyze the data to evaluate software performance and reliability.

4. Ad Hoc Analysis

Engineers perform ad hoc analyses of individual systems to diagnose and remediate problems.

5. Extract, Transform, Load (ETL)

Spark-based log processing pipelines insert log data into a structured database for querying.

6. Warehouse

A cloud data warehousing service is used to query the structured data.

7. Machine Learning

An XGBoost cluster uses time series data from the entire installed base to develop and refine models that assist customers with forward planning. Pure1 users can utilize the models to predict when their systems might require hardware upgrades to provide additional I/O performance and/or storage capacity.

Trouble in Paradise


As the Pure Storage install base grew, the cost of storing Pure1 data in the public cloud outpaced the growth of the base. At one point, an analysis of storage cost determined that about half the cost was for storage of active data (the most recent 30 days of logs), 20% was for lower-tier inactive data, and the remaining 30% was due to retrieval of stored data. On average, stored items were being retrieved (read) 8 times by the analysis, ETL, and machine learning processes. This pattern is typical of machine learning and analytic systems: read amplification.

Since we weren’t willing to reduce how much we store, we set out to reduce the cost of the I/O to the data that we store; and perhaps also increase performance.

A Hybrid Cloud

To lower cloud I/O cost, we decided to use a hybrid cloud approach. The cloud service we use offers a direct-connect service to get multiple 10Gb/s links from their data centers into managed colocation data centers through their partners. This means that you can set up a rack of hardware in one of these colo’s and have multiple 10Gb/s links into the public cloud data centers.


The reason this makes sense from a cost perspective is that once you have your data stored on a storage device, accessing it is free. This helps immensely since there are 8x more reads than writes on average.

Further, public cloud providers usually charge for data going in to their data centers but not out, so you can have cloud VM’s that can read from storage sitting in your data center for free. We were also hoping that we can get an increase in performance since we were using faster storage than the public cloud object storage provides.

To test these hypotheses, we set up a storage device in one of the partner co-located data centers and ran our large scale distributed grep workload based on Spark running in a cluster on the public cloud provider to benchmark the performance differences. We used a flashblade, but you can use any system that speaks your service’s object protocol.

1. Virtual Compute, Colo Storage


The first architecture we tried was to have the compute in a cloud-computing platform and the storage in our data center. The configuration worked as expected, and the public cloud object storage retrieval charges did, in fact, decrease. What was mildly surprising, however, was that application performance was approximately the same as that of the original configuration in the cloud, despite the better storage.

The team determined that the gating factor for application performance was the network between the colo and the public cloud data center. I/O requests from virtual machines in the cloud to the storage system incurred a latency of between 5 and 20 milliseconds. The latency is due primarily to distance-related transmission time, but the public cloud protocol stack also appeared to have some effect.

2. Local Compute, Local Storage


While we had hit our cost reduction goals, we weren’t really getting a good ROI for the investment made in the storage system. To try to see how much we could get out of this setup, we took the further step of installing compute servers at the colo and running the same Spark-based stack on top of the bare metal hardware. Running on this setup gave us a cool 2x performance jump from the basic hybrid case.

It became clear, however, that even with local processing, communications between applications and storage was still limiting performance. The next step, therefore, was to replace the 10Gb/s switch at the colo with a 40-100Gb/s capable one. Higher speed links between local applications and storage improved performance dramatically (green bars), in some cases by an order of magnitude.



Machine learning and data science workloads have heavy read (in our case 8x) low write patterns. At scale, the I/O cost of running these in the cloud can be very high.

In implementing the hybrid cloud, the our team’s goals were to (a) reduce active data retrieval cost, and (b) improve the performance of time-critical pipeline applications. After a breakeven period of roughly one year, cost will be lower, and as the graph indicates, the local processing hybrid architecture has improved the performance of our data processing pipelines .


Farhan Abrol
Farhan Abrol

Farhan Abrol is a product lead on Pure1, leading the machine learning initiatives. He works on hardware modeling, anomaly detection, and automatic workload optimization, among other things. Previously, he built highly performant distributed systems and databases. He graduated from Princeton University with a degree in Computer Science where his thesis and publications centered around Deterministic Annealing for Stochastic Variational Inference.