Defining the Churn Event


The first step to building a churn model is to clearly identify a churn event, the metric that indicates a customer has churned. How this is defined depends on how your business works: For instance, if you know that most customers don’t return if they haven’t made a purchase in 60 days, you can use unique user IDs and timestamps to pinpoint which customers have churned according to the churn event you’ve defined.

Actually defining the churn event, however, has its challenges. Sometimes, customer status might already be explicitly defined in your data schema; in other cases, a customer might cancel and reactive service as a way to save money or delay a delivery. It’s not always clear if a customer churned or not. So, if you’re presented with a database of timestamps of cancellations, activations, and renewals, how can you properly define a churn event? In this piece, we walk through the process of defining the churn target variable.

Problem Setting

Modeling churn can be highly dependent on your business model and data schema. Start by carefully considering the data limitations or rules that would make the target variable more meaningful. Below are some of the items that can have a big impact on the model building process:

  • Handling of customer renewals
  • Data filters
  • Segmentation
  • Types of churn
  • Types of cancellation events

Churn Event: Accounting for Renewals

In many cases, you must define the churn event based on the absence of some event or a period of inactivity for some time, indicated by the variable t. For example, customers may use cancellation and reactivation as a way to skip renewal opportunities and save money. In an event like this, we need to choose the time horizon t. We will cover two distinct methods for choosing the time horizon for our churn target, both of which are heuristic methods that learn the optimal churn cut off from historical customer data.

Method One, as described below, is suitable for a contractual setting in which renewals occur frequently. This method relies on the velocity of renewals as a function of time to define the time threshold for churn.

Method Two is based on optimizing precision and recall, as well as cross-validation. For example, if the threshold t is too low, then many users we say have churned will actually return, which creates a lot of false positives. If the chosen value of t that is too large, then we'd say that very few customers have actually churned, so the model recall will be low. The value that optimizes this tradeoff can be determined by finding t that maximizes the F1 score of a classifier...

Want access to the full notebook? Request a demo of the DataScience Cloud.