# The Data Science Behind Binge Buying and the Hot Hand Effect

Customer lifetime value (CLV) models often rely on a behaviorial framework that characterizes the purchasing behaviors of customers. For example, in the Pareto/NBD and BG/NBD transactional models, the customer purchasing pattern is assumed to be somewhat random with respect to time. In contrast, other models like the Pareto/GGG are more flexible and customers can purchase with some amount of regularity (e.g. I purchase groceries every week).

In this post, we will take a look at a simple diagnostic that was developed in the context of CLV modeling that allows industries to identify customers who are “bingers”. The definition of what constitutes a binging behavior may vary depending on the context, but it's generally characterized by periods of unusually high consumption/use followed by more normal episodes of consumption/use.

Let’s illustrate the binge behavior with a simple example.

In the example above, we see two “purchasing trajectories” made by customer A and B. The horizontal line represents time and each “X” represents a purchase made by a customer. Both customers have been active for the same length of time and both have made the same number of purchases. Yet Customer B behaves very differently from Customer A. Customer A makes purchases somewhat randomly in time. Customer B, however, tends to make purchases closely spaced in time -- that's your shopping spree! A is your average customer. B is your “binger”. The likelihood that B will make a purchase in a small amount of time after having made one is much higher than A.

This binging behavior goes beyond simple purchases. A very relevant example is Netflix “binge watching” which afflicts many of us, and Netflix is not afraid to tell us to get off the couch. Gamers can also spend hours playing their favorite video games, sometimes with dire consequences. A particular application we’ll discuss in this post is how to identify “binge scorers” in the National Hockey League.

Why should you care? Whether or not a customer tends to consume content or purchase in a “binge” fashion is a very important behavioral trait. Identifying the customers who exhibit binging tendencies has applications in retail, gaming, media, among other industries. It can be used to segment your customers and is an important predictor in CLV models. But how do you go about identifying these bingers?

## The Clumpiness Index Hp: Automatically Identify your “Bingers”

In a recent paper by Zhang et al. (2015), the authors described an entropic metric of clumpiness Hp. You can think of “clumpiness” as synonymous with “binging” behavior, or a sudden burst of activity that is unlikely to occur simply by chance. The authors highlighted several desirable properties of their “clumpiness index” or “clumpiness metric”, which are listed below:

• the metric should reach a minimum when all events considered are equally spaced in time
• the metric should reach a maximum when all the events are gathered together in time
• shifting the event times by a small amount in time should only change the index by a small amount
• as the events considered are moving closer or further apart on a time axis, the measure should increase or decrease. That’s to ensure convergence. I would also add that this last bullet point can also enforce a monotonically increasing index as events are moved closer together in time.

The authors defined an index that satisfies all four constraints listed above. Their index is an individual-level entropy metric:

where N represents the duration of the period over which events are recorded (e.g. N = 30 days), n is the number of events of interest (e.g. number of purchases, sessions, goals, etc.), Δti = ti − ti−1 where ti is the time of the ith event.

This metric is particularly useful because one can easily measure the “degree” of clumpiness. A high value of Hp may not necessarily mean that the customer is a “binger”. To really assess the degree of clumpiness, one has to compare Hp obtained for a particular customer with the distribution of Hp one would obtain under the null hypothesis (actions taken occur randomly in time). This can be easily done by simulations as we will see in the example below. One can compare the p values of different individuals with different values of N and n and rank these individuals based on their “binging” tendencies.

We will now illustrate an application of “binge/clumpiness” index Hp by looking at goals scored by individual players in the National Hockey League (NHL).

## Example of Application: What NHL players are “binge” scorers?

During hockey games, you often hear sports analysts say that a player is “on fire”, “on a hot streak” or has a “hot hand”. This is generally in reference to a player who seems to be scoring a lot of goals in a short number of games. Conversely, you may also hear that a player is in a scoring "drought", meaning that the player has not scored in a while.

In fact the hot hand effect, which has been defined as the phenomenon when the performance of a player is significantly better than could be expected on the basis of a player's overall record, has been widely studied in sports analytics in general, dating back from the 80’s (e.g. Gilovich et al. 1985; Tversky and Gilovich 1989). As pointed out by Zhang et al., many of these early "hot hand" tests have low statistical power. Zhang et al. demonstrated that that their entropy-based clumpiness index Hp has higher statistical power than many of these early metrics, making it a better approach to confidently identify the real "streakers".

The underlying assumption made by these sports analysts is that this “hot streak” effect is not occurring by chance, and that the player or the team is “doing something” to facilitate that behavior (success breeds success). In statistical terms, one can say that this behavior cannot be accounted for by chance alone. I wanted to dig a bit deeper and identify which players are scoring goals in a very unusual fashion.

To illustrate how one can automatically identify streakers using the Hp metric, I scraped player-level scoring data from hockey-reference.com to assess how many players, and which players specifically, are truly the ones who score goals in a “hot streak” fashion in the span of a season (2015-16). This example is similar to the purchasing behavior we discussed previously, except that in this case, the observation period is each game where a given player is in uniform and each event is a game with at least one goal scored by the player of interest.

We’ll now go through each step in the process of identifying those “hot streakers/bingers". The entire workflow was implemented and run on the DataScience.com Platform.

# All import statements here
import os
import datetime
import urllib2
import re

import numpy as np
import scipy as sc
import pandas as pd
import random
from random import randint
import math
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from scipy.interpolate import interp2d
from bs4 import BeautifulSoup
from IPython.display import Image, display
from IPython.core.display import HTML

# Matplotlib parameters:
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['mathtext.fontset']='custom'
%matplotlib inline

## Scraping Data From hockey-reference.com

We first started by scraping data from the site hockey-reference.com, a leading provider of all NHL-related statistics.

In the cell below, I defined a few utility functions to scrape the site hockey-reference.com. I focus only on the 2015-16 season. The parse_player_record() will give us a data frame for each player, as well as a scoring record for each game they played in the 2015-16 season.

def find_team_links(years):
""" This function finds the links for all the teams in the
league on the website hockey-reference.com.

Parameters
----------

years (list): list of years you want to extract the teams for.
Years are really NHL seasons.

Returns
-------

Two lists (east teams, west teams) of url suffixes (strings).

"""

base_year = 'http://www.hockey-reference.com/leagues/NHL_'

# Loop over the NHL Seasons:
for y in years :
year_url = base_year + str(y) + '.html'
response = urllib2.urlopen(year_url)
soup = BeautifulSoup(html_data)

# Divisions changed names after 1993 season:
if int(y) > 1993 :
east, west = soup.find(id="all_standings_EAS"), soup.find(id="all_standings_WES")
else :
east, west = soup.find(id="div_CAM_standings"), soup.find(id="div_WAL_standings")

east_teams, west_teams = [], []
east_teams.append(str(x.get('href')))
west_teams.append(str(x.get('href')))

return east_teams, west_teams

def find_team_players(team_url_suffixes):
"""For each team this function finds all the players who played
for that team. This function returns the links of all these
players in a list.

Parameters
----------

team_url_suffixes (list): list of team url suffixes

Returns
-------

A list of all the players links who played for each team.

"""

base_team = 'http://www.hockey-reference.com'

soup = BeautifulSoup(team_html)

players = soup.find(id='all_roster')

for player in players.find_all('a'):

def parse_player_record(player_url_suffix, year,
positions_considered=['C', 'R', 'L', 'D']):
""" This function parses the player records. It extracts the
player name, position (either C:center, L:left wing,
R:right wing, D:defenseman) and a dataframe that contains the
scoring record of that player for the year.

Parameters
----------

player_url_suffix (str): part of the url that contains the
player formation. The output of find_team_players().

year (str): The NHL season of interest

positions_considered (list): List of positions considered. By default,
we consider all positions (C, R, L, D).

Returns
-------

player_name (string), player_position (string), dataframe of scoring record

"""
# Let's get the full url for each player:
base_player = 'http://www.hockey-reference.com'
player_url_suffix = player_url_suffix[:-5]
gamelog_suffix = '/gamelog/'+ str(year) +'/'
player_link = base_player + player_url_suffix + gamelog_suffix

soup = BeautifulSoup(player_html)

# Find the player's position:
position_pattern = re.compile("Position: .")
pposition = soup.get_text()
matches = re.findall(position_pattern, pposition)
target_positions = ['Position: '+val for val in positions_considered]
player_position = str(matches[0])

# Find the player's name:
title = soup.title.string
id_year = title.find(str(int(year-1)))
player_name = title[0:id_year]
# encode name in utf-8. Important step.
player_name = str(player_name.encode('utf-8'))

# Get the player's scoring record:
if player_position in target_positions :
# Game date:
dates = soup.find_all(attrs={"data-stat": "date_game"})
# Rank of the game:
rank = soup.find_all(attrs={"data-stat": "ranker"})
# Goals scored in the game:
goals = soup.find_all(attrs={"data-stat": "goals"})

dates2 = [str(x.string) for x in dates if str(x.string) !='Date']
rank2 = [int(str(x.string)) for x in rank if str(x.string)!='Rk' ]
goals2 = [int(str(x.string)) for x in goals if str(x.string)!='G']

df = pd.DataFrame({'position':str(matches[0]),
'date':pd.to_datetime(dates2),'game_Rk':rank2, 'goals':goals2})
df['has_goal'] = df['goals'].apply(lambda x: x>=1 )

return player_name, player_position, df
else :
return player_name, player_position, pd.DataFrame()

Here’s an example of such data frame for a particular player. game_Rk is the game rank, goals indicate the number of goals scored in that game, position is the player's position, and has_goal indicates whether or not the player scored at least one goal during the game.

# Extract the scoring record for Alex Galchenyuk from the
player_name, player_position, goals_df = parse_player_record('/players/g/galchal01.html',2016)

print('Player name = {}'.format(player_name))

goals_df.head(10)
http://www.hockey-reference.com/players/g/galchal01/gamelog/2016/
Player name = Alex Galchenyuk 
date game_Rk goals position has_goal
2015-10-07 1 1 Position: C True
2015-10-10 2 0 Position: C False
2015-10-11 3 0 Position: C False
2015-10-13 4 0 Position: C False
2015-10-15 5 0 Position: C False
2015-10-17 6 0 Position: C False
2015-10-20 7 0 Position: C False
2015-10-23 8 0 Position: C False
2015-10-24 9 0 Position: C False
2015-10-27 10 0 Position: C False

In the cell below, I define the function hp() that computes Hp at the player level, and the function z() that computes the critical value of Hp given values of N and n.

Given that the critical value z needs to be calculated for all possible combinations of N and n it could be quite an expensive computational operation. To alleviate the task a bit, I provided a simple interpolation function over a grid of z values. I am not using it in this post, but such interpolation could be useful to speed up the estimation of the critical values. I will leave it to the reader to figure out how good the interpolation is ;) .

def hp_value(N, n, scoring_record):
"""This function computes the value of hp.

Parameters
----------

N (int): Number of games played
n (int): Number of goals scored
scoring_record (list): a list of all the game ranks in which the player
scored at least one goal. That list has to be in ascending order.

Returns
-------

value of H_P (float) as defined by Zhang et al. (2015)
"""
scoring_record = np.insert(scoring_record, 0, 0)
scoring_record_shifted = np.roll(scoring_record, -1)
scoring_record_shifted[-1] = N + 1

# Compute number of games between each game with at least one goal
# by the player:
delta_game = np.asarray(scoring_record_shifted - scoring_record)

# Compute the product log x * x of H_p
logx_x = np.log( delta_game / (N + 1.0) ) * delta_game / (N + 1.0)

sum_logx_x = np.sum(logx_x[np.isfinite(logx_x)])

# Hp:
Hp = 1.0 + sum_logx_x / np.log(n + 1)

return Hp

def hp(df, has_goal_column='has_goal',
game_rk_column='game_Rk'):
"""Compute the entropic clumpiness index H_p (see Zhang et al. 2015)
at the player level. df is a dataframe for a particular player.

Parameters
----------

df (dataframe): player-level dataframe. With at least two columns:
- game_rk_column: game rank (1 to .. N games played)
- has_goal_column: a boolean column that says whether or not
a goal has been scored by that player.

has_goal_column (str): name of the column with the has_goal boolean value.

game_rk_column (str): name of the column containing the game rank

Returns
-------
Returns three values:
N (int): number of games played, n (int) number of games with at least one goal,
Hp (float): entropic clumpiness index H_p for that particular player.
"""

# Number of games played:
N = len(df)
# Number of games with at least one goal:
n = len(df[df[has_goal_column]])

# We'll limit the computation of the H_p statistics to players who scored
# more than 8 goals last season.
if n >= 8:
# Find games with goals:
grk_with_goal = df[df[has_goal_column]][game_rk_column].values

# Compute H_p value:
Hp = hp_value(N, n, grk_with_goal)

return N, n, Hp
else:
return N, n, np.nan

def z(nb_games_with_goals, nb_games, nb_realizations=10000):
""" This function returns the Z value, i.e. the clumpiness.
critical value. This function generates random realizations
of the scoring history under the null hypothesis. The function
returns a dictionary with a series of critical values corresponding
to different percentiles.

Parameters
----------

nb_games_with_goals (int): Number of games with at least one goal.

nb_games (int): Number of games played

nb_realizations (int): Number of simulations to run. By default,
nb_realizations=1000.

Returns
-------

dict with percentiles as keys and critical clumpiness values as values.
Percentiles 5, 10, 90, 95, 98, 99, 99.5 and 99.9 are calculated.
"""

N = nb_games
n = nb_games_with_goals
Hp_vals = []

# Generate simulations:
for x in range(nb_realizations):
grk_with_goal = random.sample(range(int(nb_games + 1))[1:],
int(nb_games_with_goals))
grk_with_goal = np.sort(grk_with_goal)

# Compute Hp:
Hp = hp_value(N, n, grk_with_goal)
Hp_vals.append(Hp)

# Extract values corresponding to each percentile here:
perc = [5, 10, 90, 95, 98, 99, 99.5, 99.9]
perc_vals = np.percentile(Hp_vals,q=perc)

return {p:pv for p, pv in zip(perc, perc_vals)}

def interpolate_z_grid(goals=[1, 3, 5, 7, 9, 11, 13, 15],
games=[5, 10, 15, 20, 25, 30, 35, 40, 50, \
60, 65, 68, 70, 72, 75, 77, 80, 82],
nb_sims=1000, alpha=95):
"""This function returns a 2-D interpolation function of the critical
values generated by the function z() above.

This functions will generate z values at all the possible combinations of
goals and games specified by the user (Note that : games >= goals).

Parameters
----------

goals (list of integers): list of (number of games with at least one goal)
values you want z() to be evaluated for.

games (list of integers): same as goals but with the number of games played

nb_sims (int): Number of simulations per datapoint (call of z() function)

alpha (int): Percentile for critical values of z() function.

Returns
-------

fg, gs_arr, ga_arr

fg: 2-D interpolation function. You can call this function fg(n,N)
on any player statistics and it will return the critical z value.

gs_arr: array of goals used to evaluate z()
ga_arr: array of ga_arr used to evaluate z()
"""
gs_arr=[]
ga_arr=[]
z_arr=[]

for gs in goals :
for ga in games :
if gs < ga :
zp = z(gs, ga, nb_realizations=nb_sims)
gs_arr.append(gs)
ga_arr.append(ga)
z_arr.append(zp[alpha])

fg = interp2d(gs_arr, ga_arr, z_arr, kind='linear')

return fg, gs_arr, ga_arr

## Evaluating Hp and Comparing to Critical Values

In the cell below, we compute Hp, and the critical value for forward Alex Galchenyuk of the Montreal Canadiens:

# Let's compute the value of H_p for Alex Galchenyuk
N, n, Hp = hp(goals_df)
print('(N, n, Hp) = ({}, {} ,{}) '.format(N, n, Hp))

# ...and compare it with the critical value of Hp for the same values
# of N and n:
critical_values = z(n, N)

# dictionary of percentiles and associated critical values:
print(critical_values)
[ 1 11 22 23 24 25 35 38 41 50 52 56 61 62 64 66 67 68 70 74 80 82]
(N,n,Hp) = (82,22,0.111806003198)
{99.5: 0.16758449482852886, 98: 0.14632658975439244, 99: 0.15601626665506121, 5: 0.054961803694521486,
99.9: 0.18685682277569601, 10: 0.061413408763393088, 90: 0.12101946081361747, 95: 0.13299412838401914}

We can see in the cell above that the Hp value is below the 95th percentile. We can't rule out the possibility that the purchasing pattern is random. So we can't confidently say that this player is a streaker.

Now let's automate the process above and repeat for all players who played during the 2015-16 NHL season.

# Get all the teams links for a particular season (year is year end of season)
# Warning: This can take a while to run.

# Let's find all the team urls:

summary_list = []

# Go over each team:

# Find all the players belonging to that team:

# Loop over all players who played in the 2015-2016 season:
player_name, position, goals_df = parse_player_record(player_link, 2016)
N, n = 0, 0
if len(goals_df)> 0:
N,n,Hp = hp(goals_df, has_goal_column='has_goal', game_rk_column='game_Rk')

'player_name', 'position',
'N', 'n', 'Hp']
# Let's save a copy to disk just in case ...
summary_df.to_csv('players_2016_stats.csv')

## Identify the Streakers

In the cell below, I listed which players are above a detection threshold of 95, 98, and 99th percentile:

# Eliminate all players who don't have a value of Hp. Note that we only
# compute Hp for players with at least 8 goals in the season.

# This cell will take a while to run. the calls to the function z() is the
# bottleneck here. I'd recommend using the function fg that is returned
# from interpolate_z_grid() if you want to run this cell on all NHL
# players. (see interpolate_z_grid() above)
#
# Here's an example of how you would use it:
#
# # Create the grid of values:
# z_interp, gs, ga = interpolate_z_grid(nb_sims=1000, alpha=95)
# # Get the 95th percentile H_p value for someone who scored 9
# goals in 46 games.
# z_interp(9,46)
# # The function above can replace calls to z(n,N)
#
filter_to_apply = (pd.isnull(summary_df['Hp']))

summary_df = summary_df[~filter_to_apply]

# H_p value above the 95,98,and 99th percentile
# of the null distribution:
summary_df['sig_99'] = summary_df.apply(lambda row: row['Hp'] >= z(row['n'],row['N'])[99], axis=1)
summary_df['sig_98'] = summary_df.apply(lambda row: row['Hp'] >= z(row['n'],row['N'])[98], axis=1)
summary_df['sig_95'] = summary_df.apply(lambda row: row['Hp'] >= z(row['n'],row['N'])[95], axis=1)

Who are the streakers? Well, let's take a conservative cut and look at the players who are above the 99th percentile of the null distribution. It is interesting to see that two key Buffalo Sabres players, Matt Moulson and Ryan O'Reilly, are found on this list.

# Let's look at players who are above the 99th percentile.
streakers = summary_df[summary_df['sig_99']]

streakers
Out[23]:
team player_link player_name position N n Hp sig_99 sig_98 sig_95
663 /teams/OTT/2016.html /players/t/turriky01.html Kyle Turris Position: C 57 12 0.273827 True True True
734 /teams/BUF/2016.html /players/m/moulsma01.html Matt Moulson Position: L 81 8 0.356719 True True True
737 /teams/BUF/2016.html /players/o/oreilry01.html Ryan O'Reilly Position: C 71 19 0.178175 True True True
1066 /teams/CBJ/2016.html /players/s/saadbr01.html Brandon Saad Position: L 78 26 0.136983 True True True

Do these results make sense? One way to validate the legitimacy of these results is to look at the scoring record of these streakers. In the cell below, we showcase the scoring record of each one of these streakers using a simple raster plot. Each vertical blue line represents a game in which the player scored at least one goal. Since players miss games due to various reasons (e.g injuries, benched, etc), it's not a surprise to see game_Rk not going all the way to 82 games, which is the duration of an entire NHL season.

#Let's display the scoring record of each player using
# a simple raster plot

for index,row in streakers.iterrows():
print('')
print('')
print(player_name)

goals_df.plot(x='game_Rk',y='has_goal',kind='bar',figsize=(16,0.5),color='blue',legend=None)
plt.yticks([],[])
plt.show()
http://www.hockey-reference.com/players/t/turriky01/gamelog/2016/
Kyle Turris 

http://www.hockey-reference.com/players/m/moulsma01/gamelog/2016/
Matt Moulson  

http://www.hockey-reference.com/players/o/oreilry01/gamelog/2016/
Ryan O'Reilly  

http://www.hockey-reference.com/players/s/saadbr01/gamelog/2016/Brandon Saad


Clearly these scoring patterns are far from random. A natural consequence of being a player experiencing a hot streak is that you will also experience extended periods with no goals. One of the most striking examples of this behavior is Ryan O'Reilly, who had a drought of 24 games (42-65) before scoring in 3 of the last 6 games he played last season. Another great example is Brandon Saad, who had 3 droughts of 9 games or more during the season. These players all experienced very hot streaks followed by prolonged droughts.

## Conclusion

I presented a simple metric, Hp, to automatically identify, with high confidence, the "bingers" or "streakers" in a dataset. The metric is intuitive and robust sets of simulations can be constructed to uncover the range of Hp values that are consistent with the null hypothesis.

As for the hockey players findings, it is interesting to note that contrary to what you may hear on TV, we can confidently say that less than about ten players last season were "streakers". Two key players of the Buffalo Sabres were "streakers" last season.

As you can see, there are definite parallels between the behaviors of players who fall into the "hot hand" phenomenon and consumers who exhibit binge buying behavior. Once you get into the habit of identifying these patterns amongst customers in your company, you will have the ability to leverage opportunities. This is an important aspect we are discussing in our Customer Lifetime Value Playbook available through the DataScience.com Platform.

Identifying high-probability binge buyers will enable your organization to maximize revenue by implementing effective upselling campaigns and leveraging targeted coupons and email blasts at the most opportune times.

Want access to more content around CLV like this one? Request a demo today of the DataScience.com Platform.