This tutorial provides a step-by-step guide for predicting churn using Python. Boosting algorithms are fed with historical user information in order to make predictions. This type of pipeline is a basic predictive technique that can be used as a foundation for more complex models. 

What is Churn and Why Does it Matter?

Churn is defined slightly differently by each organization or product. Generally, the customers who stop using a product or service for a given period of time are referred to as churners. As a result, churn is one of the most important elements in the Key Performance Indicator (KPI) of a product or service. A full customer lifecycle analysis requires taking a look at retention rates in order to better understand the health of the business or product. 

In the gaming industry, churn comes in different flavors and at different speeds. For instance, in games where players must be engaged on a day-to-day basis, a player who doesn't login within 24 hours may be considered a churner. On the other hand, in games where players aren’t necessarily playing the game every day, the time frame that makes up a churn is much longer. It is important for a predictive pipeline to be robust enough to handle such variances.

From a machine learning perspective, churn can be formulated as a binary classification problem. Although there are other approaches to churn prediction (for example, survival analysis), the most common solution is to label “churners” over a specific period of time as one class and users who stay engaged with the product as the complementary class.  

Data Attributes and Labels 

The initial ingredient for building any predictive pipeline is data. For churn specifically, historical data is captured and stored in a data warehouse, depending on the application domain. The process of churn definition and establishing data hooks to capture relevant events is highly iterative. It is very important to keep this in mind as the initial churn definition, with its associated data hooks, may not be applicable or relevant anymore as a product or a service matures. That’s why it’s essential for data scientists to not only monitor the performance of the predictive pipeline over time but also to pay close attention to the alignment of churn definition with the product’s changes as they might affect who the churners are.

The specific attributes used in a churn model are highly domain dependent. However, broadly speaking, the most common attributes capture user behavior with regards to engagement level with a product or service. This can be thought of as the number of times that a user logs into her/his account in a week or the amount of time that a user spends on a portal. In short, frequency and intensity of usage/engagement are among the strongest signals to predict churn.

Feature Engineering

This is where the domain knowledge of a data scientist comes into play! Feature engineering corresponds to a specific set of actions and data manipulations that one can do to capture the same or similar information. Philosophically speaking, this is presenting another perspective on a single truth. For instance, in the churn definition we pointed out that the behavior of a user is captured on a temporal basis, i.e. if the user leaves a website, a game, or service over a period of time. What if one changes this temporal basis to something more specific? For instance, level-based churn corresponds to players who never level up in a game. This makes the churn definition much more robust since the period of time it takes for players to level up at the beginning of the game is usually much shorter compared to the end-game levels. 

Feature Importance  

One of the key purposes of churn prediction is to find out what factors increase churn risk. The tree below is a simple demonstration on how different features—in this case, three features: ‘received promotion,’ ‘years with firm,’ and ‘partner changed job’—can determine employee churn in an organization. Such transparency in machine learning models can be helpful when product managers are looking for a rule-of-thumb to base their decisions on.


Tree-based machine learning models, including the boosting model discussed in this article, make it easy to visualize feature importance. For instance, the code snippet below shows how a simple xgboost model is visualized using the ‘plot_tree’ library in python.

# plot decision tree
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_tree
import matplotlib.pyplot as plt
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier(), y)
# plot single tree


Boosting Models

Boosting models (including XGBoost used in this tutorial) are essentially made from multiple weak learners, in this case, decision trees. These weak learners only need to perform slightly better than random and the ensemble of them would formulate a strong learner aka XGBoost. We train our XGboost classifier by feeding previously discussed user behavioral data as x_train, and churn labels as the y_train.    

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier(), y_train)
# make predictions for test data
y_pred = model.predict_proba(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Note that accuracy is used in the above snippet, and as discussed in the previous section, we want to make sure that we are leveraging the power of ROC curves to avoid being misled in case there is an unbalanced representation of labels in our data.

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("ROC AUC ", roc_auc)

In this case, we are printing the “AUC” or Area Under the Curve from the ROC curve, which is a common approach to evaluate the binary classifier performance.

Stratified Sampling

The most common approach to train machine learning models is to randomly sample the data in various cross-validation sets. This approach has proven to be very productive as long as the data labels are relatively balanced. However, when random sampling is applied to an unbalanced dataset, the imbalance spreads to the sampling. As a result, the predictive model would have a hard time finding a pattern to distinguish and setting boundaries between the two classes simply because one class is underrepresented to the model. Stratified sampling can help overcome this problem. In stratified sampling, the label of each datapoint is taken into account during the sampling process, and as a result any imbalances between existing classes can be preserved for cross-validation. If imbalances are present, the scikit-learn package imbalanced-learn can be used to level them out.

from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)

The Curse of Accuracy with Unbalanced Datasets

In most churn problems, the number of churners far exceeds the number of users who continue to stay in the game. This causes the labeled dataset to be unbalanced in the number of samples from each case. A caveat with learning patterns in unbalanced datasets is the predictive model’s performance metrics. In many instances, model accuracy is the first metric looked at to determine the performance of a model. Unfortunately, in unbalanced datasets this measure can be misleading. For instance, imagine a dataset where 99% of users are labeled as churners. By just flipping a one-sided coin that always predicts a constant label, one can develop a "model" that is 99% percent accurate! Strange, right!?

There are multiple approaches to dealing with this issue. To start, one can look for various ways of data gathering and data preprocessing to better balance the classes. Most machine learning models in their naive form take this as their primary assumption so it is important to understand the underlying assumptions of any specific model, whether about the distribution of attributes along classes or the respective distribution of class labels. As discussed in the previous section, we assumed the data source is fixed and we dealt with unbalanced labels through boosting models.  

ROC Curves 

Receiver Operating Characteristic (ROC) curves are a data scientist's best friend and are always on top of their toolbox. The ROC curve is a fundamental tool for diagnostic test evaluation. In a ROC curve, the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter.


Once we have the right set of metrics to measure the performance of our predictive model, we can dive right into building it. As the saying goes, if you can measure it, you can manage it! 

For a model that is performing well, the Area Under the Curve (AUC) on the ROC curve will be higher than a model that is performing poorly. From the figure below, the behavior and performance of different models can be easily interpreted with a glance at their ROC curve. 


Contingency/Confusion Matrix

Another powerful tool that can be used alongside ROC curves is a contingency table (or confusion matrix or contingency matrix…you get the idea! We’ll be referring to it as contingency table in this article for consistency’s sake). A confusion matrix clearly shows the model performance broken down into true positives, true negatives, false positives, and false negatives. Depending on the application domain, it may not be too important for a predictive pipeline to have too many false negatives (users that were predicted to stay engaged but left). In some other cases with lower fidelity, it might be more critical to capture as many churners as possible at the cost of having higher false positives (users that were predicted to churn but stayed engaged).

The tradeoff between false negatives and false positives has been a battle for a while. There is no general description as this tradeoff is highly dependent on the domain of the problem. For instance, take smoke detectors. Since human lives are in danger, one wouldn’t mind false alarms every now and then as long as smoke is correctly detected at the right times (high false positive vs. low false negative). On the other hand, when it comes to something like traffic tickets where every ticket is costly to the driver, one would only want to be alerted when the model is highly certain that a detection has happened, such as when a driver runs a red light (low false positive, high false negative). In short, depending on the situation, one might choose to optimize for one at the expense of the other.      

from sklearn import metrics
import matplotlib.pyplot as plt

def printContingencyTable(y_test, y_pred, imagename):
    confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
    plt.title('Confusion matrix')

Production Pipeline

Once an initial model prototype has been established that surpasses the desired performance thresholds, it is ready for the production line. An example of a production line can be thought of as the following:

  • Data injection from a data warehouse (e.g. redshift)
  • Model training update (e.g. EC2)
  • Test and report the most recent model performance to the stakeholders
  • Loading the flagged users (potential churners) back to the database

Note that the actions after level 4 were not discussed in this tutorial. What should be done with the users that are churn-tagged remains an open question to product owners. A scientific approach is to run A/B tests on various segments of churners, optimize for a product, and experience change that increases user retention.  


scikit-learn Model Selection How to Develop Your First XGBoost Model in Python with scikit-learn Receiver Operating Characteristic (ROC)
ROC Analysis 4 Challenges With Predictive Employee Turnover Analytics How to Visualize Gradient Boosting Decision Trees With XGBoost in Python
Seyed Sajjadi
Seyed Sajjadi

Seyed Sajjadi is a data scientist at Electronic Arts (EA). His general research interest lies primarily in the theory and application of artificial intelligence, specifically in cognitive architectures, machine learning, computer vision, and robotics. With a demonstrated history of working in academia and industry, Seyed has published multiple publications in peer-reviewed journals and conferences. He has designed and deployed various machine learning pipelines to help EA better understand and predict player behaviors. These include a lifetime value model for evaluating cohort quality of players, a market basket model and a recommendation engine to create a more personalized experience for players, and clustering and segmentation analyses to better understand user engagement with live services.

Related Content