A Guide to Writing Faster, More Readable Data Analysis Code

 

Introduction


Consider the for loop. You've probably been writing them since the first day of your programming career. It works pretty well for pulling things out of arrays, aside from the occasional indexing error at the end of the list. Python even gives sweet syntactic sugar such as for .. in and list comprehensions to make loops easier. But there's a problem with for loops, especially when iterating through large data structures utilized in typical data science applications. The problem with for loops is that they're often very slow. This notebook demonstrates alternatives to loops in your code that offer performance and readability improvements of multiple orders of magnitude. It compares native Python loop performance to NumPy and pandas vectorized operations and provides recipes for performing efficient aggregation and transformation with pandas.
 

Benchmarking loop performance


In the following cells, the running time of a number of different methods of summing a list of values are compared including a simple for loop, Python's built-in sum function and the sum method of the NumPy ndarray.
 
In [1]:
from __future__ import print_function, division

import random
import timeit
from collections import defaultdict
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm

from IPython.display import display
In [2]:
# Generate some random numbers
N = 1000000
num_list = [random.randint(0,1000) for i in range(N)]
num_arr = np.array(num_list)
num_series = pd.Series(num_list)
In [3]:
%%timeit

# Standard for loop accumulation
total = 0
for n in num_list:
    total += n
 
10 loops, best of 3: 23.3 ms per loop
In [4]:
# Reduce function
%timeit reduce(lambda x, y: x+y, num_list)
 
10 loops, best of 3: 61.2 ms per loop
 
The Python standard library sum function is 5 times as fast as the for loop.
 
In [5]:
# Built-in Python sum function
%timeit sum(num_list)
 
100 loops, best of 3: 4.93 ms per loop
 
In this trivial example, the ndarray sum method call is multiple orders of magnitude faster than the for loop and about an order of magnitude faster than the Python sum function.
 
In [6]:
# ndarray sum method
%timeit num_arr.sum()
 
1000 loops, best of 3: 516 µs per loop
 
The sum method provided by the pandas Series object runs 2 ms faster than the standard library's sum and almost an order of magnitude slower than the ndarray's sum.
 
In [7]:
# pandas Series sum method
%timeit num_series.sum()
 
100 loops, best of 3: 3.57 ms per loop
 
Simply converting from the pandas representation to a NumPy representation via the Series.values field yields an almost full order of magnitude performance improvement in the sum function.
 
In [8]:
# Convert pandas series to ndarray and sum
%timeit num_series.values.sum()
 
1000 loops, best of 3: 516 µs per loop
 
To summarize in terms of best performance at summing a list, NumPy ndarray sum > pandas Series sum > standard library sum > for loop > standard library reduce. Why is NumPy so much faster than the Python standard library? The ndarray object is of fixed size and all elements are the same datatype. This allows aggregations such as summing to be performed by pre-compiled C code. The ndarray's sum method and the pandas Series' sum method are examples of vectorized operations, a standard component of array programming. In addition to the performance boost noted above for both the ndarray and the Series, vectorized code is often more readable. The sum call on the ndarray is a single line rather than 3 lines in the loop example. The introduction to NumPy provides additional detail on the benefits of using ndarray operations as opposed to loops. Now that the advantages of vectorized operations have been clearly established, how can vectorization be extended beyond the trivial summing example? One solution that delivers both speed and readability boosts is the pandas GroupBy object.

Introducing the pandas GroupBy object


A common data analysis task is combining aggregation with one or more dimensions. For example, in addition to total sales we might be interested in seeing sales broken out by city and state. Assuming a typical tabular representation of sales data, one way to accomplish this task is to loop through the dataset, adding the desired dimensions (e.g. state) to a dictionary as necessary and incrementing the sales value. The complexity of storing and accessing this aggregated data in nested dictionary structures increases as additional dimensions are considered. The GroupBy object in pandas allows us to perform efficient vectorized aggregation. Let's compare a sum across one dimension using the Titanic dataset.
In [9]:
# Import the titanic dataset from Rdatasets with statsmodels
titanic = sm.datasets.get_rdataset("titanic","COUNT").data
In [10]:
# Examine the output
titanic.groupby('class')['survived'].count()
Out[10]:
class
1st class    325
2nd class    285
3rd class    706
Name: survived, dtype: int64
 
The result of the calling the groupby function along with the count function is a pandas Series containing the the number of survivors indexed by passenger class. The values of the grouping column become the index of the resulting aggregation of each group.
 
The groupby method segments the DataFrame into tuples. We can examine the contents of these tuples with a loop.
 
In [11]:
# Loop through groupby, unpacking tuple and taking first 5 rows
for grp, df in titanic.groupby('class'):
    print('Group: %s' % grp)
    print('DataFrame description: \n%s\n' % df.head())
    print()
 
Group: 1st class
DataFrame description: 
       class     age  sex survived
0  1st class  adults  man      yes
1  1st class  adults  man      yes
2  1st class  adults  man      yes
3  1st class  adults  man      yes
4  1st class  adults  man      yes


Group: 2nd class
DataFrame description: 
         class     age  sex survived
325  2nd class  adults  man      yes
326  2nd class  adults  man      yes
327  2nd class  adults  man      yes
328  2nd class  adults  man      yes
329  2nd class  adults  man      yes


Group: 3rd class
DataFrame description: 
         class     age  sex survived
610  3rd class  adults  man      yes
611  3rd class  adults  man      yes
612  3rd class  adults  man      yes
613  3rd class  adults  man      yes
614  3rd class  adults  man      yes
 
Iterating through the results of the groupby method call shows us that each group tuple consists of the name of the group and a DataFrame containing the rows corresponding to that group. We see in the output above that the first 5 rows of the 3rd class group all have the string '3rd class' in the class column. We can split the vectorized class survival count into two separate lines to further illustrate the process.
 
In [12]:
# First create the group
grp = titanic.groupby('class')
# Count the number of survivors using a DataFrame-style syntax
grp['survived'].count()
Out[12]:
class
1st class    325
2nd class    285
3rd class    706
Name: survived, dtype: int64
 
We select the survived column on the GroupBy object the same way we select a column from a DataFrame. Let's compare the performance of the vectorized survival count to the loop implementation.
 
In [13]:
# Vectorized implementation
%timeit titanic.groupby('class')['survived'].count()
 
1000 loops, best of 3: 417 µs per loop
In [14]:
%%timeit
# For loop implementation
survived = defaultdict(int)
for row in titanic.itertuples():
    if row[4] == 'yes':
        survived[row[1]] += 1
 
1000 loops, best of 3: 410 µs per loop
 
In terms of performance, the for loop version and the vectorized version are about the same. The single-line vectorized implementation is much more concise and readable than the loop implementation. The groupby syntax is also more descriptive, the count aggregation function appended to the groupby call clearly states the operation being performed. The loop version is much less obvious. To aggregate on multiple levels we simply provide additional column labels in a list to the groupby function.
 
In [15]:
%timeit titanic.groupby(['class','sex'])['survived'].count()
 
1000 loops, best of 3: 750 µs per loop
 
As before, the output of the count function applied to the class and sex grouping results in a Series of survivor counts indexed by class and sex. Because we specified multiple levels of grouping, the Series is indexed by a MultiIndex.
 
In [16]:
# Examine output
titanic.groupby(['class','sex'])['survived'].count()
Out[16]:
class      sex  
1st class  man      180
           women    145
2nd class  man      179
           women    106
3rd class  man      510
           women    196
Name: survived, dtype: int64
 

Another example with standard deviation

In the introductory example we didn't realize a significant performance benefit because our desired aggregation only required a single traversal of the DataFrame. Let's try computing the standard deviation using the DataFrame's std method and using a loop. We'll use the Midwest demographics dataset included with ggplot2.
 
In [17]:
# Load the midwest dataset from ggplot2
midwest = sm.datasets.get_rdataset("midwest","ggplot2").data
In [18]:
# Get standard deviations of population density by state
%timeit midwest.groupby('state')['popdensity'].std()
 
1000 loops, best of 3: 305 µs per loop
In [19]:
# Examine the output
popdensity_stds_grp = midwest.groupby('state')['popdensity'].std()
popdensity_stds_grp
Out[19]:
state
IL    9608.283297
IN    4195.924314
MI    7489.934415
OH    8084.947549
WI    7618.036983
Name: popdensity, dtype: float64
In [20]:

%%timeit
from math import sqrt

# For each state population densities
popdensity_states = defaultdict(list)
for row in midwest.itertuples():
    state = row[3]
    popdensity = row[6]
    popdensity_states[state].append(popdensity)

# Now calculate standard deviation
popdensity_stds = defaultdict(int)
for state, densities in popdensity_states.iteritems():
    state_mean = sum(densities) * 1.0 / len(densities)
    state_variances = [(density - state_mean)**2 for density in densities]
    state_stdev = sqrt(sum(state_variances) * 1.0 / (len(state_variances) - 1))
    
    popdensity_stds[state] = state_stdev
 
100 loops, best of 3: 2.05 ms per loop
 
Computing the standard deviation via vectorized operations is nearly an order of magnitude faster. The vectorized version is still only a single line and an order of magnitude more concise than the loop version. Now that we've further established both performance and code readability as motivation for implementing vector operations, let's turn our attention to methods of vectorizing existing loops using pandas DataFrame methods.

Vectorized operations with pandas


In the previous examples using the groupby method, we specified one or more columns that we wanted to segment into groups and then aggregate. When attempting to remove for loops from your code, a good initial strategy is to simply identify all the dimensions that you're interested in segmenting. Next, determine the type of output desired. Is the goal to aggregate or is it to transform? Aggregation reduces the data, while transformation preserves the original shape and structure of the data. Let's consider the cars dataset which offers a number of factor variables to demonstrate different scenarios.
 
In [21]:
cars = sm.datasets.get_rdataset("Cars93","MASS").data
 

Aggregation

Our examples have all been simple aggregation applications, reducing all the observations within a group to a single row. We've been using the built-in DataFrame aggregation functions including sum and std, but groupby allows us to pass in any function. Let's say we were interested in finding the difference between the price of the most expensive car for each manufacturer and the average price for all cars in the dataset. We'll parse each piece of the natural language for clues on implementing in vectorized form.
  • price of the most expensive car for each manufacturer: Group by manufacturer, find the maximum price
  • average price for all cars in the dataset: Find the mean price for all cars
  • finding the difference: Subtract the mean price of all cars from the group maxes
We'll pass an anonymous function to the agg method of the GroupBy object. As a result of the aggregation function, we'll get back one row for each distinct entry in the field(s) by which are grouping.
 
In [22]:
%%timeit
# Subtract overall mean from group max
mean_price = cars['Price'].mean()
cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price)
 
1000 loops, best of 3: 1.34 ms per loop
 
Let's verify our results on cars manufactured by Cadillac. We see below that computing the difference between mean and max for Cadillac matches the result from the aggregated DataFrame.
 
In [23]:
mean_price = cars['Price'].mean()
max_price_cadillac = cars[cars.Manufacturer == 'Cadillac']['Price'].max()
car_max_mean = cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price)

print('Mean price for all cars: %s' % np.around(mean_price, decimals=2))
print('Max price for Cadillac: %s' % max_price_cadillac)
print('Difference between max and mean: %s\n' % np.around(max_price_cadillac - mean_price, decimals=2))
print('Cars DataFrame agg result: %s\n' % car_max_mean.loc['Cadillac'])
print('Sample output from aggregated dataframe:')
print(car_max_mean.head())
 
Mean price for all cars: 19.51
Max price for Cadillac: 40.1
Difference between max and mean: 20.59

Cars DataFrame agg result: 20.5903225806

Sample output from aggregated dataframe:
Manufacturer
Acura       14.390323
Audi        18.190323
BMW         10.490323
Buick        6.790323
Cadillac    20.590323
Name: Price, dtype: float64
In [24]:
%%timeit
mean_price = cars['Price'].mean()
manuf_max_prices = dict()
manuf_max_mean_diffs = dict()

for row in cars.itertuples():
    
    manufacturer = row[1]
    price = row[5]
    
    if manufacturer in manuf_max_prices:
        if price > manuf_max_prices[manufacturer]:
            manuf_max_prices[manufacturer] = price
            manuf_max_mean_diffs[manufacturer] = price - mean_price
    else:
        manuf_max_prices[manufacturer] = price
        manuf_max_mean_diffs[manufacturer] = price - mean_price
 
1000 loops, best of 3: 1.48 ms per loop
We saw above that groupby produces new DataFrames for each group. Then we select the Price column from each group, and the resulting Series is passed to the agg method for each group. We call the max method on each group's Series and then subtract the overall mean price of the cars DataFrame. We again see the benefits of the vectorized implementation in terms of brevity and readability. Performance is roughly equivalent between the two implementations. However, the pandas implementation has nearly an order of magnitude fewer lines of code at 2 lines compared to 13 lines in the loop implementation.
 

Transformation

In addition to aggregating data by groups, we also may want to perform an operation at the observation (row) level that incorporates some group-level attribute. As an example from the cars dataset, for each car type we might want to find the difference in MPG for each model from the average of all models of that type. These kinds of operations are called transformations, as they yield new information while preserving the existing structure of the data. Let's consider the previously mentioned fuel economy example. Once again, examining the language will help us determine how to set up the task as a vector operation.
  • for each type: We are grouping by type
  • find the difference in MPG for each model: We want to compute the difference for each model rather than aggregate up to the type level
  • from the average of all models of that type: We want the difference to be from the within-group average
From the natural language description of the task, we want to group by type and transform the MPG of each model by subtracting the type's mean MPG. In both this example and the previous aggregation example, we grouped by the column mentioned in the "for each" component. We can accomplish this with a single line using pandas and verify that the number of rows returned by the transformation matches the number of rows in the original data.
 
In [25]:
# groupby with transform
mpg_mean_diff = cars.groupby('Type')['MPG.city'].transform(lambda x: x - x.mean())

# Test for equality in length (number of rows)
mpg_mean_diff.shape[0] == cars.shape[0]
Out[25]:
True
 

Filtration

In addition to aggregation and transformation, filtration can also be applied to groups to select only those records belonging to groups that meet the specified criteria. The syntax is similar to standard boolean indexing of DataFrames. Let's say we wanted to select only models produced by manufacturers with more than 5 models represented in the data.
  • select only models: We are selecting individual rows, not aggregating
  • produced by manufacturers with more than 5 models: We want to apply the criteria to groups consisting of each manufacturer's models, taking the count
As before, this operation requires just a single line using pandas. We group by manufacturer and then pass a lambda function into the filter method which returns a boolean. We see that only three three manufacturers out of the original 32 had more than five models.
 
In [26]:
# Select all rows where the group length is greater than 5
cars_manuf_grt_five = cars.groupby('Manufacturer').filter(lambda x: len(x) > 5)

print('Number of manufacturers in cars dataset: %s' % cars['Manufacturer'].nunique())
print('Number of manufacturers with more than five models: %s' % cars_manuf_grt_five['Manufacturer'].nunique())
 
Number of manufacturers in cars dataset: 32
Number of manufacturers with more than five models: 3

Vectorization Recipes

Referencing multiple columns

In addition to operating on a single column of data, we may want to incorporate data from multiple columns in our grouping aggregations and transformations. In the example below using data from refugees appealing a rejection of refugee status from the Canadian government, we compute the appeals approved by an independent rater but rejected by a judge as a percentage of total appeals by country. In this case we want to count rows where the judge's decision was 'no' and the rater's decision was 'yes'. To accomplish this, we group by the country and use the standard label notation for referencing columns in a DataFrame (e.g. df['label']) to find the number of rows that match our criteria which we divide by the total number of rows in the group.
 
In [27]:
# Load greene dataset
greene = sm.datasets.get_rdataset("Greene","car").data

# Find percentage of no judge and yes rater decisions
app_grp = greene.groupby('nation').agg(lambda x: len(x[(x['decision'] == 'no') & (x['rater'] != 'yes')]) / len(x))
 

Referencing the index

In some cases we might want to reference the column by which we are grouping in the applied function. For example, we might have the mean calories of a manufacturer's cereal stored in a dictionary and we want to subtract this value from the calorie count of an individual variety. When iterating through a GroupBy object, the grouping field is available via the name property. The cell below demonstrates usage if the name property to retrieve a value from a dictionary within a transform.
 
In [28]:
# Load the cereal dataset
cereal = sm.datasets.get_rdataset("UScereal","MASS").data

# Dictionary containing average calorie counts by manufacturer
cereal_mfr_dict = cereal.groupby('mfr')['calories'].mean().to_dict()

# Subtract average calorie brand calorie count from each cereal variety
cal_mean_diffs = cereal.groupby('mfr')['calories'].transform(lambda x: x - cereal_mfr_dict[x.name])
 

Summary

We saw in this notebook the usefulness of vectorization from both performance and readability persepectives. DataFrame methods for aggregation and grouping are typically faster to run and write than the equivalent standard library implementation with loops. For instances where performance is a serious consideration, NumPy ndarray methods offer as much as one order of magnitude increases in speed over DataFrame methods and the standard library.
 
Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.
Ben Van Dyke
Author
Ben Van Dyke

Data Analyst at DataScience. When I'm not digging into a dataset, I'm likely on a bicycle searching for funk records and the best tacos in Los Angeles.