A Guide to Writing Faster, More Readable Data Analysis Code
forloop. You've probably been writing them since the first day of your programming career. It works pretty well for pulling things out of arrays, aside from the occasional indexing error at the end of the list. Python even gives sweet syntactic sugar such as
for .. inand list comprehensions to make loops easier. But there's a problem with
forloops, especially when iterating through large data structures utilized in typical data science applications. The problem with
forloops is that they're often very slow. This notebook demonstrates alternatives to loops in your code that offer performance and readability improvements of multiple orders of magnitude. It compares native Python loop performance to NumPy and pandas vectorized operations and provides recipes for performing efficient aggregation and transformation with pandas.
Benchmarking loop performance
In the following cells, the running time of a number of different methods of summing a list of values are compared including a simple for loop, Python's built-in
sumfunction and the
summethod of the NumPy
from __future__ import print_function, division import random import timeit from collections import defaultdict import numpy as np import pandas as pd import scipy import statsmodels.api as sm from IPython.display import display
# Generate some random numbers N = 1000000 num_list = [random.randint(0,1000) for i in range(N)] num_arr = np.array(num_list) num_series = pd.Series(num_list)
%%timeit # Standard for loop accumulation total = 0 for n in num_list: total += n
10 loops, best of 3: 23.3 ms per loop
# Reduce function %timeit reduce(lambda x, y: x+y, num_list)
10 loops, best of 3: 61.2 ms per loop
The Python standard library
sumfunction is 5 times as fast as the for loop.
# Built-in Python sum function %timeit sum(num_list)
100 loops, best of 3: 4.93 ms per loop
In this trivial example, the ndarray
summethod call is multiple orders of magnitude faster than the for loop and about an order of magnitude faster than the Python
# ndarray sum method %timeit num_arr.sum()
1000 loops, best of 3: 516 µs per loop
summethod provided by the pandas Series object runs 2 ms faster than the standard library's
sumand almost an order of magnitude slower than the ndarray's
# pandas Series sum method %timeit num_series.sum()
100 loops, best of 3: 3.57 ms per loop
Simply converting from the pandas representation to a NumPy representation via the
Series.valuesfield yields an almost full order of magnitude performance improvement in the
# Convert pandas series to ndarray and sum %timeit num_series.values.sum()
1000 loops, best of 3: 516 µs per loop
To summarize in terms of best performance at summing a list, NumPy ndarray
sum> pandas Series
sum> standard library
sum> for loop > standard library
reduce. Why is NumPy so much faster than the Python standard library? The ndarray object is of fixed size and all elements are the same datatype. This allows aggregations such as summing to be performed by pre-compiled C code. The ndarray's
summethod and the pandas Series'
summethod are examples of vectorized operations, a standard component of array programming. In addition to the performance boost noted above for both the ndarray and the Series, vectorized code is often more readable. The sum call on the ndarray is a single line rather than 3 lines in the loop example. The introduction to NumPy provides additional detail on the benefits of using ndarray operations as opposed to loops. Now that the advantages of vectorized operations have been clearly established, how can vectorization be extended beyond the trivial summing example? One solution that delivers both speed and readability boosts is the pandas
Introducing the pandas GroupBy object
A common data analysis task is combining aggregation with one or more dimensions. For example, in addition to total sales we might be interested in seeing sales broken out by city and state. Assuming a typical tabular representation of sales data, one way to accomplish this task is to loop through the dataset, adding the desired dimensions (e.g. state) to a dictionary as necessary and incrementing the sales value. The complexity of storing and accessing this aggregated data in nested dictionary structures increases as additional dimensions are considered. The GroupBy object in pandas allows us to perform efficient vectorized aggregation. Let's compare a sum across one dimension using the Titanic dataset.
# Import the titanic dataset from Rdatasets with statsmodels titanic = sm.datasets.get_rdataset("titanic","COUNT").data
# Examine the output titanic.groupby('class')['survived'].count()
class 1st class 325 2nd class 285 3rd class 706 Name: survived, dtype: int64
The result of the calling the
groupbyfunction along with the
countfunction is a pandas
Seriescontaining the the number of survivors indexed by passenger class. The values of the grouping column become the index of the resulting aggregation of each group.
groupbymethod segments the DataFrame into tuples. We can examine the contents of these tuples with a loop.
# Loop through groupby, unpacking tuple and taking first 5 rows for grp, df in titanic.groupby('class'): print('Group: %s' % grp) print('DataFrame description: \n%s\n' % df.head()) print()
Group: 1st class DataFrame description: class age sex survived 0 1st class adults man yes 1 1st class adults man yes 2 1st class adults man yes 3 1st class adults man yes 4 1st class adults man yes Group: 2nd class DataFrame description: class age sex survived 325 2nd class adults man yes 326 2nd class adults man yes 327 2nd class adults man yes 328 2nd class adults man yes 329 2nd class adults man yes Group: 3rd class DataFrame description: class age sex survived 610 3rd class adults man yes 611 3rd class adults man yes 612 3rd class adults man yes 613 3rd class adults man yes 614 3rd class adults man yes
Iterating through the results of the
groupbymethod call shows us that each group tuple consists of the name of the group and a DataFrame containing the rows corresponding to that group. We see in the output above that the first 5 rows of the 3rd class group all have the string '3rd class' in the
classcolumn. We can split the vectorized class survival count into two separate lines to further illustrate the process.
# First create the group grp = titanic.groupby('class') # Count the number of survivors using a DataFrame-style syntax grp['survived'].count()
class 1st class 325 2nd class 285 3rd class 706 Name: survived, dtype: int64
We select the
survivedcolumn on the
GroupByobject the same way we select a column from a DataFrame. Let's compare the performance of the vectorized survival count to the loop implementation.
# Vectorized implementation %timeit titanic.groupby('class')['survived'].count()
1000 loops, best of 3: 417 µs per loop
%%timeit # For loop implementation survived = defaultdict(int) for row in titanic.itertuples(): if row == 'yes': survived[row] += 1
1000 loops, best of 3: 410 µs per loop
In terms of performance, the for loop version and the vectorized version are about the same. The single-line vectorized implementation is much more concise and readable than the loop implementation. The
groupbysyntax is also more descriptive, the
countaggregation function appended to the
groupbycall clearly states the operation being performed. The loop version is much less obvious. To aggregate on multiple levels we simply provide additional column labels in a list to the
1000 loops, best of 3: 750 µs per loop
As before, the output of the
countfunction applied to the class and sex grouping results in a Series of survivor counts indexed by class and sex. Because we specified multiple levels of grouping, the Series is indexed by a MultiIndex.
# Examine output titanic.groupby(['class','sex'])['survived'].count()
class sex 1st class man 180 women 145 2nd class man 179 women 106 3rd class man 510 women 196 Name: survived, dtype: int64
Another example with standard deviationIn the introductory example we didn't realize a significant performance benefit because our desired aggregation only required a single traversal of the DataFrame. Let's try computing the standard deviation using the DataFrame's
stdmethod and using a loop. We'll use the Midwest demographics dataset included with ggplot2.
# Load the midwest dataset from ggplot2 midwest = sm.datasets.get_rdataset("midwest","ggplot2").data
# Get standard deviations of population density by state %timeit midwest.groupby('state')['popdensity'].std()
1000 loops, best of 3: 305 µs per loop
# Examine the output popdensity_stds_grp = midwest.groupby('state')['popdensity'].std() popdensity_stds_grp
state IL 9608.283297 IN 4195.924314 MI 7489.934415 OH 8084.947549 WI 7618.036983 Name: popdensity, dtype: float64
%%timeit from math import sqrt # For each state population densities popdensity_states = defaultdict(list) for row in midwest.itertuples(): state = row popdensity = row popdensity_states[state].append(popdensity) # Now calculate standard deviation popdensity_stds = defaultdict(int) for state, densities in popdensity_states.iteritems(): state_mean = sum(densities) * 1.0 / len(densities) state_variances = [(density - state_mean)**2 for density in densities] state_stdev = sqrt(sum(state_variances) * 1.0 / (len(state_variances) - 1)) popdensity_stds[state] = state_stdev
100 loops, best of 3: 2.05 ms per loop
Computing the standard deviation via vectorized operations is nearly an order of magnitude faster. The vectorized version is still only a single line and an order of magnitude more concise than the loop version. Now that we've further established both performance and code readability as motivation for implementing vector operations, let's turn our attention to methods of vectorizing existing loops using pandas DataFrame methods.
Vectorized operations with pandas
In the previous examples using the
groupbymethod, we specified one or more columns that we wanted to segment into groups and then aggregate. When attempting to remove for loops from your code, a good initial strategy is to simply identify all the dimensions that you're interested in segmenting. Next, determine the type of output desired. Is the goal to aggregate or is it to transform? Aggregation reduces the data, while transformation preserves the original shape and structure of the data. Let's consider the cars dataset which offers a number of factor variables to demonstrate different scenarios.
cars = sm.datasets.get_rdataset("Cars93","MASS").data
AggregationOur examples have all been simple aggregation applications, reducing all the observations within a group to a single row. We've been using the built-in DataFrame aggregation functions including
groupbyallows us to pass in any function. Let's say we were interested in finding the difference between the price of the most expensive car for each manufacturer and the average price for all cars in the dataset. We'll parse each piece of the natural language for clues on implementing in vectorized form.
- price of the most expensive car for each manufacturer: Group by manufacturer, find the maximum price
- average price for all cars in the dataset: Find the mean price for all cars
- finding the difference: Subtract the mean price of all cars from the group maxes
aggmethod of the
GroupByobject. As a result of the aggregation function, we'll get back one row for each distinct entry in the field(s) by which are grouping.
%%timeit # Subtract overall mean from group max mean_price = cars['Price'].mean() cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price)
1000 loops, best of 3: 1.34 ms per loop
Let's verify our results on cars manufactured by Cadillac. We see below that computing the difference between mean and max for Cadillac matches the result from the aggregated DataFrame.
mean_price = cars['Price'].mean() max_price_cadillac = cars[cars.Manufacturer == 'Cadillac']['Price'].max() car_max_mean = cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price) print('Mean price for all cars: %s' % np.around(mean_price, decimals=2)) print('Max price for Cadillac: %s' % max_price_cadillac) print('Difference between max and mean: %s\n' % np.around(max_price_cadillac - mean_price, decimals=2)) print('Cars DataFrame agg result: %s\n' % car_max_mean.loc['Cadillac']) print('Sample output from aggregated dataframe:') print(car_max_mean.head())
Mean price for all cars: 19.51 Max price for Cadillac: 40.1 Difference between max and mean: 20.59 Cars DataFrame agg result: 20.5903225806 Sample output from aggregated dataframe: Manufacturer Acura 14.390323 Audi 18.190323 BMW 10.490323 Buick 6.790323 Cadillac 20.590323 Name: Price, dtype: float64
%%timeit mean_price = cars['Price'].mean() manuf_max_prices = dict() manuf_max_mean_diffs = dict() for row in cars.itertuples(): manufacturer = row price = row if manufacturer in manuf_max_prices: if price > manuf_max_prices[manufacturer]: manuf_max_prices[manufacturer] = price manuf_max_mean_diffs[manufacturer] = price - mean_price else: manuf_max_prices[manufacturer] = price manuf_max_mean_diffs[manufacturer] = price - mean_price
1000 loops, best of 3: 1.48 ms per loop
We saw above that
groupbyproduces new DataFrames for each group. Then we select the
Pricecolumn from each group, and the resulting Series is passed to the
aggmethod for each group. We call the
maxmethod on each group's Series and then subtract the overall mean price of the
carsDataFrame. We again see the benefits of the vectorized implementation in terms of brevity and readability. Performance is roughly equivalent between the two implementations. However, the pandas implementation has nearly an order of magnitude fewer lines of code at 2 lines compared to 13 lines in the loop implementation.
TransformationIn addition to aggregating data by groups, we also may want to perform an operation at the observation (row) level that incorporates some group-level attribute. As an example from the cars dataset, for each car type we might want to find the difference in MPG for each model from the average of all models of that type. These kinds of operations are called transformations, as they yield new information while preserving the existing structure of the data. Let's consider the previously mentioned fuel economy example. Once again, examining the language will help us determine how to set up the task as a vector operation.
- for each type: We are grouping by type
- find the difference in MPG for each model: We want to compute the difference for each model rather than aggregate up to the type level
- from the average of all models of that type: We want the difference to be from the within-group average
# groupby with transform mpg_mean_diff = cars.groupby('Type')['MPG.city'].transform(lambda x: x - x.mean()) # Test for equality in length (number of rows) mpg_mean_diff.shape == cars.shape
Filtration¶In addition to aggregation and transformation, filtration can also be applied to groups to select only those records belonging to groups that meet the specified criteria. The syntax is similar to standard boolean indexing of DataFrames. Let's say we wanted to select only models produced by manufacturers with more than 5 models represented in the data.
- select only models: We are selecting individual rows, not aggregating
- produced by manufacturers with more than 5 models: We want to apply the criteria to groups consisting of each manufacturer's models, taking the count
filtermethod which returns a boolean. We see that only three three manufacturers out of the original 32 had more than five models.
# Select all rows where the group length is greater than 5 cars_manuf_grt_five = cars.groupby('Manufacturer').filter(lambda x: len(x) > 5) print('Number of manufacturers in cars dataset: %s' % cars['Manufacturer'].nunique()) print('Number of manufacturers with more than five models: %s' % cars_manuf_grt_five['Manufacturer'].nunique())
Number of manufacturers in cars dataset: 32 Number of manufacturers with more than five models: 3
Referencing multiple columnsIn addition to operating on a single column of data, we may want to incorporate data from multiple columns in our grouping aggregations and transformations. In the example below using data from refugees appealing a rejection of refugee status from the Canadian government, we compute the appeals approved by an independent rater but rejected by a judge as a percentage of total appeals by country. In this case we want to count rows where the judge's decision was 'no' and the rater's decision was 'yes'. To accomplish this, we group by the country and use the standard label notation for referencing columns in a DataFrame (e.g.
df['label']) to find the number of rows that match our criteria which we divide by the total number of rows in the group.
# Load greene dataset greene = sm.datasets.get_rdataset("Greene","car").data # Find percentage of no judge and yes rater decisions app_grp = greene.groupby('nation').agg(lambda x: len(x[(x['decision'] == 'no') & (x['rater'] != 'yes')]) / len(x))
Referencing the indexIn some cases we might want to reference the column by which we are grouping in the applied function. For example, we might have the mean calories of a manufacturer's cereal stored in a dictionary and we want to subtract this value from the calorie count of an individual variety. When iterating through a
GroupByobject, the grouping field is available via the
nameproperty. The cell below demonstrates usage if the
nameproperty to retrieve a value from a dictionary within a
# Load the cereal dataset cereal = sm.datasets.get_rdataset("UScereal","MASS").data # Dictionary containing average calorie counts by manufacturer cereal_mfr_dict = cereal.groupby('mfr')['calories'].mean().to_dict() # Subtract average calorie brand calorie count from each cereal variety cal_mean_diffs = cereal.groupby('mfr')['calories'].transform(lambda x: x - cereal_mfr_dict[x.name])
SummaryWe saw in this notebook the usefulness of vectorization from both performance and readability persepectives. DataFrame methods for aggregation and grouping are typically faster to run and write than the equivalent standard library implementation with loops. For instances where performance is a serious consideration, NumPy
ndarraymethods offer as much as one order of magnitude increases in speed over DataFrame methods and the standard library.
Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.