### A Guide to Writing Faster, More Readable Data Analysis Code

## Introduction

Consider the

`for`

loop. You've probably been writing them since the first day of your programming career. It works pretty well for pulling things out of arrays, aside from the occasional indexing error at the end of the list. Python even gives sweet syntactic sugar such as `for .. in`

and list comprehensions to make loops easier. But there's a problem with `for`

loops, especially when iterating through large data structures utilized in typical data science applications. The problem with `for`

loops is that they're often very slow. This notebook demonstrates alternatives to loops in your code that offer performance and readability improvements of multiple orders of magnitude. It compares native Python loop performance to NumPy and pandas vectorized operations and provides recipes for performing efficient aggregation and transformation with pandas.## Benchmarking loop performance

In the following cells, the running time of a number of different methods of summing a list of values are compared including a simple for loop, Python's built-in

`sum`

function and the `sum`

method of the NumPy `ndarray`

.In [1]:

```
from __future__ import print_function, division
import random
import timeit
from collections import defaultdict
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm
from IPython.display import display
```

In [2]:

```
# Generate some random numbers
N = 1000000
num_list = [random.randint(0,1000) for i in range(N)]
num_arr = np.array(num_list)
num_series = pd.Series(num_list)
```

In [3]:

```
%%timeit
# Standard for loop accumulation
total = 0
for n in num_list:
total += n
```

`10 loops, best of 3: 23.3 ms per loop`

In [4]:

```
# Reduce function
%timeit reduce(lambda x, y: x+y, num_list)
```

`10 loops, best of 3: 61.2 ms per loop`

The Python standard library

`sum`

function is 5 times as fast as the for loop.In [5]:

```
# Built-in Python sum function
%timeit sum(num_list)
```

`100 loops, best of 3: 4.93 ms per loop`

In this trivial example, the ndarray

`sum`

method call is multiple orders of magnitude faster than the for loop and about an order of magnitude faster than the Python `sum`

function.In [6]:

```
# ndarray sum method
%timeit num_arr.sum()
```

`1000 loops, best of 3: 516 µs per loop`

The

`sum`

method provided by the pandas Series object runs 2 ms faster than the standard library's `sum`

and almost an order of magnitude slower than the ndarray's `sum`

.In [7]:

```
# pandas Series sum method
%timeit num_series.sum()
```

```
100 loops, best of 3: 3.57 ms per loop
```

Simply converting from the pandas representation to a NumPy representation via the

`Series.values`

field yields an almost full order of magnitude performance improvement in the `sum`

function.In [8]:

```
# Convert pandas series to ndarray and sum
%timeit num_series.values.sum()
```

```
1000 loops, best of 3: 516 µs per loop
```

To summarize in terms of best performance at summing a list, NumPy ndarray

`sum`

> pandas Series `sum`

> standard library `sum`

> for loop > standard library `reduce`

. Why is NumPy so much faster than the Python standard library? The ndarray object is of fixed size and all elements are the same datatype. This allows aggregations such as summing to be performed by pre-compiled C code. The ndarray's `sum`

method and the pandas Series' `sum`

method are examples of vectorized operations, a standard component of array programming. In addition to the performance boost noted above for both the ndarray and the Series, vectorized code is often more readable. The sum call on the ndarray is a single line rather than 3 lines in the loop example. The introduction to NumPy provides additional detail on the benefits of using ndarray operations as opposed to loops. Now that the advantages of vectorized operations have been clearly established, how can vectorization be extended beyond the trivial summing example? One solution that delivers both speed and readability boosts is the pandas `GroupBy`

object.## Introducing the pandas GroupBy object

A common data analysis task is combining aggregation with one or more dimensions. For example, in addition to total sales we might be interested in seeing sales broken out by city and state. Assuming a typical tabular representation of sales data, one way to accomplish this task is to loop through the dataset, adding the desired dimensions (e.g. state) to a dictionary as necessary and incrementing the sales value. The complexity of storing and accessing this aggregated data in nested dictionary structures increases as additional dimensions are considered. The GroupBy object in pandas allows us to perform efficient vectorized aggregation. Let's compare a sum across one dimension using the Titanic dataset.

In [9]:

```
# Import the titanic dataset from Rdatasets with statsmodels
titanic = sm.datasets.get_rdataset("titanic","COUNT").data
```

In [10]:

```
# Examine the output
titanic.groupby('class')['survived'].count()
```

Out[10]:

```
class
1st class 325
2nd class 285
3rd class 706
Name: survived, dtype: int64
```

The result of the calling the

`groupby`

function along with the `count`

function is a pandas `Series`

containing the the number of survivors indexed by passenger class. The values of the grouping column become the index of the resulting aggregation of each group.The

`groupby`

method segments the DataFrame into tuples. We can examine the contents of these tuples with a loop.In [11]:

```
# Loop through groupby, unpacking tuple and taking first 5 rows
for grp, df in titanic.groupby('class'):
print('Group: %s' % grp)
print('DataFrame description: \n%s\n' % df.head())
print()
```

```
Group: 1st class
DataFrame description:
class age sex survived
0 1st class adults man yes
1 1st class adults man yes
2 1st class adults man yes
3 1st class adults man yes
4 1st class adults man yes
Group: 2nd class
DataFrame description:
class age sex survived
325 2nd class adults man yes
326 2nd class adults man yes
327 2nd class adults man yes
328 2nd class adults man yes
329 2nd class adults man yes
Group: 3rd class
DataFrame description:
class age sex survived
610 3rd class adults man yes
611 3rd class adults man yes
612 3rd class adults man yes
613 3rd class adults man yes
614 3rd class adults man yes
```

Iterating through the results of the

`groupby`

method call shows us that each group tuple consists of the name of the group and a DataFrame containing the rows corresponding to that group. We see in the output above that the first 5 rows of the 3rd class group all have the string '3rd class' in the `class`

column. We can split the vectorized class survival count into two separate lines to further illustrate the process.In [12]:

```
# First create the group
grp = titanic.groupby('class')
# Count the number of survivors using a DataFrame-style syntax
grp['survived'].count()
```

Out[12]:

```
class
1st class 325
2nd class 285
3rd class 706
Name: survived, dtype: int64
```

We select the

`survived`

column on the `GroupBy`

object the same way we select a column from a DataFrame. Let's compare the performance of the vectorized survival count to the loop implementation.In [13]:

```
# Vectorized implementation
%timeit titanic.groupby('class')['survived'].count()
```

`1000 loops, best of 3: 417 µs per loop`

In [14]:

```
%%timeit
# For loop implementation
survived = defaultdict(int)
for row in titanic.itertuples():
if row[4] == 'yes':
survived[row[1]] += 1
```

`1000 loops, best of 3: 410 µs per loop`

In terms of performance, the for loop version and the vectorized version are about the same. The single-line vectorized implementation is much more concise and readable than the loop implementation. The

`groupby`

syntax is also more descriptive, the `count`

aggregation function appended to the `groupby`

call clearly states the operation being performed. The loop version is much less obvious. To aggregate on multiple levels we simply provide additional column labels in a list to the `groupby`

function.In [15]:

```
%timeit titanic.groupby(['class','sex'])['survived'].count()
```

`1000 loops, best of 3: 750 µs per loop`

As before, the output of the

`count`

function applied to the class and sex grouping results in a Series of survivor counts indexed by class and sex. Because we specified multiple levels of grouping, the Series is indexed by a MultiIndex.In [16]:

```
# Examine output
titanic.groupby(['class','sex'])['survived'].count()
```

Out[16]:

```
class sex
1st class man 180
women 145
2nd class man 179
women 106
3rd class man 510
women 196
Name: survived, dtype: int64
```

### Another example with standard deviation

In the introductory example we didn't realize a significant performance benefit because our desired aggregation only required a single traversal of the DataFrame. Let's try computing the standard deviation using the DataFrame's`std`

method and using a loop. We'll use the Midwest demographics dataset included with ggplot2.In [17]:

```
# Load the midwest dataset from ggplot2
midwest = sm.datasets.get_rdataset("midwest","ggplot2").data
```

In [18]:

```
# Get standard deviations of population density by state
%timeit midwest.groupby('state')['popdensity'].std()
```

`1000 loops, best of 3: 305 µs per loop`

In [19]:

```
# Examine the output
popdensity_stds_grp = midwest.groupby('state')['popdensity'].std()
popdensity_stds_grp
```

Out[19]:

```
state
IL 9608.283297
IN 4195.924314
MI 7489.934415
OH 8084.947549
WI 7618.036983
Name: popdensity, dtype: float64
```

In [20]:

```
%%timeit
from math import sqrt
# For each state population densities
popdensity_states = defaultdict(list)
for row in midwest.itertuples():
state = row[3]
popdensity = row[6]
popdensity_states[state].append(popdensity)
# Now calculate standard deviation
popdensity_stds = defaultdict(int)
for state, densities in popdensity_states.iteritems():
state_mean = sum(densities) * 1.0 / len(densities)
state_variances = [(density - state_mean)**2 for density in densities]
state_stdev = sqrt(sum(state_variances) * 1.0 / (len(state_variances) - 1))
popdensity_stds[state] = state_stdev
```

`100 loops, best of 3: 2.05 ms per loop`

Computing the standard deviation via vectorized operations is nearly an order of magnitude faster. The vectorized version is still only a single line and an order of magnitude more concise than the loop version. Now that we've further established both performance and code readability as motivation for implementing vector operations, let's turn our attention to methods of vectorizing existing loops using pandas DataFrame methods.

## Vectorized operations with pandas

In the previous examples using the

`groupby`

method, we specified one or more columns that we wanted to segment into groups and then aggregate. When attempting to remove for loops from your code, a good initial strategy is to simply identify all the dimensions that you're interested in segmenting. Next, determine the type of output desired. Is the goal to aggregate or is it to transform? Aggregation reduces the data, while transformation preserves the original shape and structure of the data. Let's consider the cars dataset which offers a number of factor variables to demonstrate different scenarios.
In [21]:

```
cars = sm.datasets.get_rdataset("Cars93","MASS").data
```

### Aggregation

Our examples have all been simple*aggregation*applications, reducing all the observations within a group to a single row. We've been using the built-in DataFrame aggregation functions including

`sum`

and `std`

, but `groupby`

allows us to pass in any function. Let's say we were interested in finding the difference between the price of the most expensive car for each manufacturer and the average price for all cars in the dataset. We'll parse each piece of the natural language for clues on implementing in vectorized form.
*price of the most expensive car for each manufacturer*: Group by manufacturer, find the maximum price*average price for all cars in the dataset*: Find the mean price for all cars*finding the difference*: Subtract the mean price of all cars from the group maxes

`agg`

method of the `GroupBy`

object. As a result of the aggregation function, we'll get back one row for each distinct entry in the field(s) by which are grouping.
In [22]:

```
%%timeit
# Subtract overall mean from group max
mean_price = cars['Price'].mean()
cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price)
```

```
1000 loops, best of 3: 1.34 ms per loop
```

Let's verify our results on cars manufactured by Cadillac. We see below that computing the difference between mean and max for Cadillac matches the result from the aggregated DataFrame.

In [23]:

```
mean_price = cars['Price'].mean()
max_price_cadillac = cars[cars.Manufacturer == 'Cadillac']['Price'].max()
car_max_mean = cars.groupby('Manufacturer')['Price'].agg(lambda grp: grp.max() - mean_price)
print('Mean price for all cars: %s' % np.around(mean_price, decimals=2))
print('Max price for Cadillac: %s' % max_price_cadillac)
print('Difference between max and mean: %s\n' % np.around(max_price_cadillac - mean_price, decimals=2))
print('Cars DataFrame agg result: %s\n' % car_max_mean.loc['Cadillac'])
print('Sample output from aggregated dataframe:')
print(car_max_mean.head())
```

```
Mean price for all cars: 19.51
Max price for Cadillac: 40.1
Difference between max and mean: 20.59
Cars DataFrame agg result: 20.5903225806
Sample output from aggregated dataframe:
Manufacturer
Acura 14.390323
Audi 18.190323
BMW 10.490323
Buick 6.790323
Cadillac 20.590323
Name: Price, dtype: float64
```

In [24]:

```
%%timeit
mean_price = cars['Price'].mean()
manuf_max_prices = dict()
manuf_max_mean_diffs = dict()
for row in cars.itertuples():
manufacturer = row[1]
price = row[5]
if manufacturer in manuf_max_prices:
if price > manuf_max_prices[manufacturer]:
manuf_max_prices[manufacturer] = price
manuf_max_mean_diffs[manufacturer] = price - mean_price
else:
manuf_max_prices[manufacturer] = price
manuf_max_mean_diffs[manufacturer] = price - mean_price
```

`1000 loops, best of 3: 1.48 ms per loop`

We saw above that

`groupby`

produces new DataFrames for each group. Then we select the `Price`

column from each group, and the resulting Series is passed to the `agg`

method for each group. We call the `max`

method on each group's Series and then subtract the overall mean price of the `cars`

DataFrame. We again see the benefits of the vectorized implementation in terms of brevity and readability. Performance is roughly equivalent between the two implementations. However, the pandas implementation has nearly an order of magnitude fewer lines of code at 2 lines compared to 13 lines in the loop implementation.### Transformation

In addition to aggregating data by groups, we also may want to perform an operation at the observation (row) level that incorporates some group-level attribute. As an example from the cars dataset, for each car type we might want to find the difference in MPG for each model from the average of all models of that type. These kinds of operations are called*transformations*, as they yield new information while preserving the existing structure of the data. Let's consider the previously mentioned fuel economy example. Once again, examining the language will help us determine how to set up the task as a vector operation.

*for each type*: We are grouping by type*find the difference in MPG for each model*: We want to compute the difference for each model rather than aggregate up to the type level*from the average of all models of that type*: We want the difference to be from the within-group average

In [25]:

```
# groupby with transform
mpg_mean_diff = cars.groupby('Type')['MPG.city'].transform(lambda x: x - x.mean())
# Test for equality in length (number of rows)
mpg_mean_diff.shape[0] == cars.shape[0]
```

Out[25]:

`True`

### Filtration¶

In addition to aggregation and transformation, filtration can also be applied to groups to select only those records belonging to groups that meet the specified criteria. The syntax is similar to standard boolean indexing of DataFrames. Let's say we wanted to select only models produced by manufacturers with more than 5 models represented in the data.*select only models*: We are selecting individual rows, not aggregating*produced by manufacturers with more than 5 models*: We want to apply the criteria to groups consisting of each manufacturer's models, taking the count

`filter`

method which returns a boolean. We see that only three three manufacturers out of the original 32 had more than five models.
In [26]:

```
# Select all rows where the group length is greater than 5
cars_manuf_grt_five = cars.groupby('Manufacturer').filter(lambda x: len(x) > 5)
print('Number of manufacturers in cars dataset: %s' % cars['Manufacturer'].nunique())
print('Number of manufacturers with more than five models: %s' % cars_manuf_grt_five['Manufacturer'].nunique())
```

```
Number of manufacturers in cars dataset: 32
Number of manufacturers with more than five models: 3
```

## Vectorization Recipes

### Referencing multiple columns

In addition to operating on a single column of data, we may want to incorporate data from multiple columns in our grouping aggregations and transformations. In the example below using data from refugees appealing a rejection of refugee status from the Canadian government, we compute the appeals approved by an independent rater but rejected by a judge as a percentage of total appeals by country. In this case we want to count rows where the judge's decision was 'no' and the rater's decision was 'yes'. To accomplish this, we group by the country and use the standard label notation for referencing columns in a DataFrame (e.g.`df['label']`

) to find the number of rows that match our criteria which we divide by the total number of rows in the group.
In [27]:

```
# Load greene dataset
greene = sm.datasets.get_rdataset("Greene","car").data
# Find percentage of no judge and yes rater decisions
app_grp = greene.groupby('nation').agg(lambda x: len(x[(x['decision'] == 'no') & (x['rater'] != 'yes')]) / len(x))
```

### Referencing the index

In some cases we might want to reference the column by which we are grouping in the applied function. For example, we might have the mean calories of a manufacturer's cereal stored in a dictionary and we want to subtract this value from the calorie count of an individual variety. When iterating through a`GroupBy`

object, the grouping field is available via the `name`

property. The cell below demonstrates usage if the `name`

property to retrieve a value from a dictionary within a `transform`

.
In [28]:

```
# Load the cereal dataset
cereal = sm.datasets.get_rdataset("UScereal","MASS").data
# Dictionary containing average calorie counts by manufacturer
cereal_mfr_dict = cereal.groupby('mfr')['calories'].mean().to_dict()
# Subtract average calorie brand calorie count from each cereal variety
cal_mean_diffs = cereal.groupby('mfr')['calories'].transform(lambda x: x - cereal_mfr_dict[x.name])
```

## Summary

We saw in this notebook the usefulness of vectorization from both performance and readability persepectives. DataFrame methods for aggregation and grouping are typically faster to run and write than the equivalent standard library implementation with loops. For instances where performance is a serious consideration, NumPy`ndarray`

methods offer as much as one order of magnitude increases in speed over DataFrame methods and the standard library.*Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.*