Like most companies transitioning from a small-startup stage to a large tech presence, Postmates has encountered challenges as it scales its existing data infrastructure to meet its rapidly growing needs. As a data engineer at Postmates, I've seen first hand that our initial implementation and designs — while suitable at a smaller scale — just aren't as effective at handling the volume of data we now collect.
That's why, in the past six months, the data infrastructure team has done quite a bit of architectural reworking. The changes we've made have been vital to improving data integrity, capacity, and access as we continue to grow as a company. This is crucial when scaling to meet the dynamic needs of a three-sided marketplace — that is, matching customers to the small businesses in your cities to the Postmates who bring products to your doorstep. While this is all still a work in progress, we are happy to share some of our learnings so far.
The Initial Data Architecture
Before we started working to scale Postmates' data infrastructure, the architecture was a relatively straightforward data streaming solution. It included several apps that wrote event data to Amazon Kinesis Streams, so that different internal Postmates teams could read the event data and save it to Redshift tables. The solution also included ETL processes, orchestrated by Airflow, that created backups of the databases and moved copies into Redshift. Third-party business intelligence tools were built on top of Redshift, which ran complicated aggregation queries for ad-hoc analysis, reporting, and other internal needs. This approach, however, was fraught with issues.
On the back end, our Airflow 1.7 implementation ran on a single server, with a growing number of DAGs scheduled to run daily. This led to rising CPU and memory demands, and a fragile deploy process that required downtime. In addition, Redshift was being overutilized as a magic bullet catch-all solution for all data needs, leading to long queueing times, slow queries, and growing storage costs, among other issues.
We also lacked complete monitoring and insights, beyond what was offered by AWS and some internal StatsD metrics. This hampered our ability to predict demand growth, adequately assess the current health of the system, or even receive proper alerts in some failure modes. As we experienced increasing internal reliance from other teams and apps, it became clear that it was time to start redesigning for scalability.
Transitioning Postmates' Data Architecture
As you would expect, it was important to support both the existing and new data architecture during the transition. By focusing on a phased rollout, we could implement upgrades piece-by-piece, avoiding a disruptive switchover to a new architecture that would take significant time and resources to build. Two of the bigger changes are highlighted and elaborated upon below.
We began by overhauling our Airflow system, as it was a vital component of our architecture that was at risk of being overwhelmed (and there were no good backup options available). Improving the Airflow system directly translated to higher data availability and accuracy for internal and external customers. We also made a number of architectural changes, which included transitioning from Airflow 1.7 to 1.8, simplifying local development with Docker containers, horizontally scaling Airflow with Celery, and establishing zero-downtime deployment jobs, while providing extensive documentation on supported data patterns in the process.
We then turned our attention to our Redshift cluster, which was another major risk to the overall scalability of the system. Our Redshift cluster is essential to all business intelligence needs at Postmates, and helps drive internal planning and financial projections. Mitigating potential problems involved reworking the security and access of the cluster, adding extensive monitoring, and reconfiguring the queuing setup, among other initiatives. In the future, we'll implement a longer-term, data lake-inspired solution, in which raw data will be placed on S3 and computed data on Redshift. But, for the time being, these shorter-term fixes have extended the life of the cluster.
Takeaways So Far
While some of our bigger architectural changes still lie ahead, we have already learned quite a bit. One thing we learned was not to discount the value of making smaller, incremental changes. It is important to take time to gather the necessary supplies, resources, time, and staff. Having a strong cross-functional team that works across product, engineering, and data infrastructure to implement change is essential for meeting company goals. The changes we have already made in Airflow and Redshift have given us the bandwidth and time to focus on re-architecting other components, and are aligned with the overall goals of the system we are trying to build.
We have also recognized the value of communicating data access patterns and expectations with our internal users. Without solid communication, teams will either inappropriately use the data architecture you have, or, even worse, will start duplicating architecture independently, believing they have no alternatives. Much of our effort has been spent educating internal users on the strengths and weaknesses of our systems, and this has been a very worthwhile effort.
Last but not least, it’s important to balance fire fighting with long-term investment. While outages need to take priority, don't fall into the trap of focusing primarily on fixing very small, niche problems. Put fires out as they happen, and then start building out a longer-term strategy that could prevent the fire from starting in the first place.
As a result of our preliminary efforts, we have been able to move a number of key reports to reliable Airflow processes. And we can now handle a greater level of event data and query usage on our Redshift cluster, while still supporting our growing internal user base.