In a competitive world, the key to business success is to understand enough about your customers' behavior and preferences so that you can provide a personalized service to both your prospective and existing customer base. Using customer behavior analytics techniques, you can predict how a customer will respond to a business situation, thereby giving you the information you need to approach them. These techniques can be broadly grouped into two buckets:

  1. Regression problems: The response to this type of problem is a numerical value; for example, based on customer demographics and past purchasing patterns, how much will the customer spend in a particular instance?
  2. Classification problems: The response to this type of problem is categorical; for example, based on the customer demographics and his or her past experience with the product/service, will the customer stay or leave the product/service?

Predicting whether a customer will stop using your product or service is an important component of customer behavior analytics called churn prediction. In this post, I will analyze two aspects of churn prediction:

  1. For the given scenario, which factors are mainly responsible for customer churn?
  2. Given the data pertaining to a new customer, how do you predict whether the customer will churn or stay, and what are the associated probabilities?

The Business Problem: Predicting Churn at a Telecom Service Provider

The telecom business is challenged by frequent customer churn due to several factors related to service and customer demographics. The dataset we'll use in our analysis includes a list of service-related factors about existing customers and information about whether they have stayed or left the service provider. Our objective is to understand which of the factors contribute most to customer churn and to predict which customers will potentially churn based on service-related factors.

About the Dataset

The dataset used for the analysis can be downloaded here. It consists of information for 5,000 customers and includes independent variables such as account length, number of voicemail messages, total daytime charge, total evening charge, total night charge, total international charge, and number of customer service calls. 

The dependent variable in the dataset is whether the customer churned or not, which is indicated by a 1 for "yes" and 0 for "no." 

What is Discriminant Analysis?

In order to uncover which variables are responsible for churn and predict whether a customer will churn or not, we will use discriminant analysis. 

Discriminant analysis is a segmentation tool. It segments groups in a way as to achieve maximum separation between them. This technique makes use of the information provided by the X variables to achieve the clearest possible separation between two groups (in our case, the two groups are customers who stay and customers who churn). Below is our formula: 

D= b0 + b1X1 + b2X2 + .. bnXn 

Here, D is the discriminant score, b is the discriminant coefficient, and X1 and X2 are independent variables. 

The discriminant coefficient is estimated by maximizing the ratio of the variation between the classes of customers and the variation within the classes. In other words, points belonging to the same class should be close together, while also being far away from the other clusters.

Discriminant Analysis: A High-Level Approach 

Below, I've broken down the steps we will be following in order to predict customer churn, starting with data preparation and ending with validating the accuracy of our model. 

download.png

Let's get started.

Formulating the Problem and Preparing the Data

To begin, we need to read the CSV file in R and convert the target variables into categorical variables. For the purpose of discriminant analysis, the Y variable has to be a categorical variable (meaning it is one of a limited number of possible values).


## Setting the working directory and reading the source file

setwd("C:/Users/Sowmya CR/Google Drive/datascience_blog")

data=read.csv("churn2.csv")

 

## Converting the target variable to categorical variable

data$churn=factor(data$churn)

str(data)

    ## 'data.frame':    5000 obs. of  8 variables:
##  $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
##  $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Now that we've converted our target variables, let's find the baseline churn rate for the dataset:


    prop.table((table(data$churn)))

## 
##      0      1 
## 0.8586 0.1414

As you can see here, the baseline churn rate is 14.14%, which is an indicator that the dataset is unbalanced — there are more 0s than 1s. Subsequently, this will have an impact on how we interpret model performance measures.

Exploratory Data Analysis 

Using ggplot2, I will explore our dataset and determine the impact our variables have on churn, starting with account length. 


    ggplot(data, aes(x=churn, y=account_length, fill=churn)) + 
    geom_boxplot()

discriminant-analysis-3.png

As you can see in our box plot, account length does not seem to have an influence on customer churn. Now, let's look at voicemails: 


    ggplot(data, aes(x=churn, y=number_vmail_messages, fill=churn)) + 
geom_boxplot()

 discriminant-analysis-4.png

The number of voicemail messages seems to be lower for customers who churn compared to those who don't.


    ggplot(data, aes(x=churn, y=total_day_charge, fill=churn)) + 
    geom_boxplot()

discriminant-analysis-5.png 

Generally, tariffs for day, evening, night, and international calls are higher for customers who churn compared to those who don't. This could indicate that customers who churn are not happy with the amount of money they are paying for their plan. 


    ggplot(data, aes(x=churn, y=number_customer_service_calls, fill=churn)) + 
    geom_boxplot()

 download-5.png

Here, we see that the number of customer service calls made to customers who churn is relatively high. This indicates that customers who have churned have tried contacting customer service — but might have not received a satisfactory resolution to their issue.

Using the assumptions we've derived from our exploratory data analysis, we will dive deeper into our variables using discriminant analysis.

Since the data have variables that are in various dimensions, it is advisable to scale them so the range of each variable does not have an influence on the discriminant coefficients. After the dataset is scaled, we will split the dataset into training and test datasets. We will build the model based on the training dataset and test the model performance using the test dataset.


    ## Scale X variables
x <- subset(data,select=c(1:7))
scaled_x=scale(x)
data1=cbind(data[8],scaled_x)


## Split into training and test sets
library(caTools)

    set.seed(123)
split = sample.split(data1$churn, SplitRatio = 0.7)
traindata = subset(data1, split == TRUE)
testdata = subset(data1, split == FALSE)

#### Check if distribution of partition data is correct Testing dataset
prop.table((table(traindata$churn)))

    ## 
##         0         1 
## 0.8585714 0.1414286

    prop.table((table(testdata$churn)))

    ## 
##         0         1 
## 0.8586667 0.1413333

 

Step One: Test the Significance of the Discriminant Function Using MANOVA

To test the significance of the discriminant function, we will use multivariate analysis of variance (MANOVA). Here's the approach we will take:

    1. The null hypothesis of MANOVA is that all the means of the independent variables are equal, which implies that the independent variables are not differentiators of the group.
    2. The alternative hypothesis is that at least one independent variable has a different mean or, in other words, a significant differentiator. 

head(traindata)

    ##    churn account_length number_vmail_messages total_day_charge
## 1      0      0.6988716             1.2730178       1.57391660
## 3      0      0.9256029            -0.5724919       1.17116913
## 6      0      0.4469479            -0.5724919       0.80007390
## 7      0      0.5225250             1.1991974       0.70293426
## 9      0      0.4217555            -0.5724919       0.07862111
## 10     0      1.0263724             2.1588624       1.45276492
##    total_eve_charge total_night_charge total_intl_charge
## 1       -0.06384268         0.87619875       -0.09549925
## 3       -1.57192653        -0.74666192        0.69590136
## 6        0.39463343         0.07136541       -1.43685621
## 7        2.92439754         0.24288727       -0.99420841
## 9        2.98723437         0.30445922       -0.56497419
## 10       0.42256091         2.49466143        0.33373498
##    number_customer_service_calls
## 1                      -0.436632
## 3                      -1.202116
## 6                      -1.202116
## 7                       1.094336
## 9                      -0.436632
## 10                     -1.202116


    X1=cbind(as.matrix(traindata[,2:8]))
Y1=as.vector(traindata[,1])
Manova=manova(X1~Y1)
summary(Manova, test = "Wilks")

    ##             Df   Wilks approx F num Df den Df    Pr(>F)    
## Y1           1 0.89534   58.315      7   3492 < 2.2e-16 ***
## Residuals 3498                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you see above, the Wilks' lambda for MANOVA is closer to 1, indicating that the extent of discrimination in the model is relatively low. But the p-value is highly significant, indicating that the null hypothesis cannot be accepted. This implies that the discriminant model is highly significant.

Step 2: Develop the Fisher Discriminant Function 

Now we need to identify a combination of features that separates our customers that are likely to churn from those who are not. I'll be using the packages DiscriMiner and MASS to do so. 


library(DiscriMiner)
library(MASS)

    discPower(X1,Y1)

    ##                               correl_ratio wilks_lambda F_statistic
## account_length                 0.000069076    0.9999309   0.2416445
## number_vmail_messages          0.008786183    0.9912138  31.0064957
## total_day_charge               0.036972676    0.9630273 134.2956924
## total_eve_charge               0.005896454    0.9941035  20.7481365
## total_night_charge             0.002381859    0.9976181   8.3516345
## total_intl_charge              0.004819239    0.9951808  16.9393317
## number_customer_service_calls  0.043692063    0.9563079 159.8175953
##                                    p_value
## account_length                6.230517e-01
## number_vmail_messages         2.765485e-08
## total_day_charge              0.000000e+00
## total_eve_charge              5.416984e-06
## total_night_charge            3.877117e-03
## total_intl_charge             3.948399e-05
## number_customer_service_calls 0.000000e+00

    ##Fischer discriminant functions
desDA(X1,Y1)

Since there are only two values for the Y variable, there is only one discriminant function DF1. The coefficients of the discriminant function can be seen below:


    ## 
## Descriptive Discriminant Analysis
## ---------------------------------
## $power      discriminant power
## $values     table of eigenvalues
## $discrivar  discriminant variables
## $discor     correlations
## $scores     discriminant scores
## ---------------------------------
## 
## $power
##                                cor_ratio        wilks_lamb     
## account_length                   0.00006907600F    0.99993092400
## number_vmail_messages            0.00878618268F    0.99121381732
## total_day_charge                 0.03697267618    0.96302732382
## total_eve_charge                 0.00589645399F    0.99410354601
## total_night_charge               0.00238185881F    0.99761814119
## total_intl_charge                0.00481923872F    0.99518076128
## number_customer_service_calls    0.04369206260    0.95630793740
##                                F_statistic      p_values       
## account_length                   0.24164453981    0.62305169936
## number_vmail_messages           31.00649574143F    0.00000002765F
## total_day_charge               134.29569242558    0.0000000000F0
## total_eve_charge                20.74813648496    0.00000541698F
## total_night_charge               8.35163453174    0.00387711739F
## total_intl_charge               16.93933172621    0.00003948399F
## number_customer_service_calls  159.81759534952    0.0000000000F0
## 
## 
## $values
##      value    proportion  accumulated
## DF1    0.103  100.000     100.000    
## 
## 
## $discrivar
##                                     DF1
## constant                        0.02478
## account_length                  0.05173
## number_vmail_messages          -0.30702
## total_day_charge                0.62922
## total_eve_charge                0.28724
## total_night_charge              0.16000
## total_intl_charge               0.25025
## number_customer_service_calls   0.70106
## 
## 
## $discor
##                             DF1
## account_length          0.02569
## number_vmail_messages  -0.28974
## total_day_charge        0.59435
## total_eve_charge        0.23736
## total_night_charge      0.15086
## total_intl_charge       0.21458
## 
## 
## $scores
##           z1
## 1    0.45226
## 3   -0.25423
## 6   -0.35048
## 7    1.52316
## 9    0.73109
## 10   0.09042
## ...

If we sort the X variables by descending order of the coefficients, we are able to understand the influence of each X variable on differentiating the Y variable:

constant 0.02478 
number_customer_service_calls 0.70106 
total_day_charge 0.62922 
total_eve_charge 0.28724 
total_intl_charge 0.25025 
total_night_charge 0.16 
account_length 0.05173 
number_vmail_messages -0.30702

In terms of magnitude, the number of customer service calls has the most impact, whereas the account length has the least impact. Number of voice mail messages has a negative sign indicating that it has a negative impact on churn. In other words, as the number of voice mail messages increases, the probability of churn decreases.

Step 3: Differentiation of Individual Independent Variables and Wilks' Lamba

The p-value is not significant for account length, but it highly significant for the other X variables. This indicates that all X variables except account length are excellent predictors in terms of differentiating our customer groups.

Step 4: Correlation Between Discriminant Function and Independent Variable 

The relative importance of the X variables can be determined by the correlation ratio of each of the variables. The number of customer service calls emerges as the most significant variable in its ability to differentiate the groups.

Step 5: Classify Records Based on Discriminant Analysis of X Variables and Predict Y Variables for the Test Set

Since the discriminant model is significant, we will use it to classify records as belonging to either customers who will churn or those who will not churn depending on the X variables. We will use the lda() function in R to classify records based on value of X variables and predict the class and probability for the test set.


    ##    churn account_length number_vmail_messages total_day_charge
## 1      0      0.6988716             1.2730178       1.57391660
## 3      0      0.9256029            -0.5724919       1.17116913
## 6      0      0.4469479            -0.5724919       0.80007390
## 7      0      0.5225250             1.1991974       0.70293426
## 9      0      0.4217555            -0.5724919       0.07862111
## 10     0      1.0263724             2.1588624       1.45276492
##    total_eve_charge total_night_charge total_intl_charge
## 1       -0.06384268         0.87619875       -0.09549925
## 3       -1.57192653        -0.74666192        0.69590136
## 6        0.39463343         0.07136541       -1.43685621
## 7        2.92439754         0.24288727       -0.99420841
## 9        2.98723437         0.30445922       -0.56497419
## 10       0.42256091         2.49466143        0.33373498
##    number_customer_service_calls
## 1                      -0.436632
## 3                      -1.202116
## 6                      -1.202116
## 7                       1.094336
## 9                      -0.436632
## 10                     -1.202116

    sublda=lda(churn~.,data = traindata)
sublda

    ## Call:
## lda(churn ~ ., data = traindata)
## 
## Prior probabilities of groups:
##         0         1 
## 0.8585714 0.1414286 
## 
## Group means:
##   account_length number_vmail_messages total_day_charge total_eve_charge
## 0   -0.002297854             0.0520458       -0.0936634      -0.03051636
## 1    0.021527668            -0.2178557        0.4578240       0.18870353
##   total_night_charge total_intl_charge number_customer_service_calls
## 0        -0.02835861       -0.03201133                   -0.09579425
## 1         0.11297312        0.16979135                    0.49432007
## 
## Coefficients of linear discriminants:
##                                       LD1
## account_length                 0.05172731
## number_vmail_messages         -0.30702403
## total_day_charge               0.62921658
## total_eve_charge               0.28723614
## total_night_charge             0.15999506
## total_intl_charge              0.25025370
## number_customer_service_calls  0.70105887

Let us review the output from the lda() function:

download-6.png

The difference in group means is highest for number of customer service calls and lowest for account length. This gives us insight into which factors contribute most to the discrimination between the groups.

download-7.png

The coefficients also give a similar pattern and throw light on which X variables contribute most to group separation.

Step 6: Visualizing the Groups

Now, let's visualize what our non-churning customers look like compared to our customers who will most likely churn. Remember, 0 indicates customers who will not churn and 1 indicates customers who will. 


    plot(sublda, dimen = 1, type = "b")

 download-8.png

The groups created by discriminant analysis can be seen in the graphs, and are in sync with the Wilks lambda value of 0.89 that we got from our MANOVA test. These graphs are a good indicator that although the model is significant, our two groups are not completely separated. There is some overlap.

Step 7: Make Predictions on the Test Set 

Using our test set, let's predict whether a customer will churn or stay based on the X variables of the test set. We will apply the discriminant model that we built using the training set to make predictions about the test set. The objective of this is to measure how the model performs on a new set of data.


    lda.pred=predict(sublda, newdata = testdata)

library(hmeasure)

class.lda=lda.pred$class
true.class<-testdata[,1]
lda.counts <- misclassCounts(class.lda,true.class)
lda.counts$conf.matrix

    ##          pred.1 pred.0
## actual.1     27    185
## actual.0     30   1258

    print(lda.counts$metrics,digits=3)

    ##      ER  Sens  Spec Precision Recall   TPR    FPR     F Youden
## 1 0.143 0.127 0.977     0.474  0.127 0.127 0.0233 0.201  0.104

 

Step 8: Evaluate Model Performance Measures 

The accuracy of the model is 1-Error rate = 1-0.1433 = 85.67%

Accuracy indicates how many correct predictions are made by the model. The model has a fairly good accuracy. However, since the dataset is unbalanced, accuracy alone may not be the sole indicator that the model is a robust model. Hence we will look at few other model performance measures.

Sensitivity also called the true positive rate is defined as the proportion of actual positives that are correctly identified by the model. The sensitivity of the model is 12.7% which is very low. For a churn prediction model, it is important that the model picks up positives as positives. It is important to make an accurate prediction of customers who will churn which is given by sensitivity.

We will now vary the threshold of the model from the default 50% to other values to decide on a optimum balance for sensitivity and specificity.


    ## classifying with a default threshold of 0.5
lda.pred$posterior[1:3,]

    ##           0          1
## 2 0.9302405 0.06975946
## 4 0.8165096 0.18349036
## 5 0.8499270 0.15007298

    scores.lda <- lda.pred$posterior[,2]
all((scores.lda > 0.5)== (class.lda=="1"))

    ## [1] TRUE

The model, by default, uses a 50% threshold to classify records as 0 or 1.


  
## threshold of 30%
lda.counts.T03 <- misclassCounts(scores.lda>0.3,true.class)
lda.counts.T03$conf.matrix

    ##          pred.1 pred.0
## actual.1     82    130
## actual.0    106   1182

    ## threshold of 20%
lda.counts.T02 <- misclassCounts(scores.lda>0.2,true.class)
lda.counts.T02$conf.matrix

    ##          pred.1 pred.0
## actual.1    134     78
## actual.0    237   1051

    ## threshold of 17%
lda.counts.T017 <- misclassCounts(scores.lda>0.17,true.class)
lda.counts.T017$conf.matrix

    ##          pred.1 pred.0
## actual.1    151     61
## actual.0    302    986

    ## threshold of 16%
lda.counts.T016 <- misclassCounts(scores.lda>0.16,true.class)
lda.counts.T016$conf.matrix

    ##          pred.1 pred.0
## actual.1    157     55
## actual.0    329    959

    ## threshold of 15%
lda.counts.T015 <- misclassCounts(scores.lda>0.15,true.class)
lda.counts.T015$conf.matrix

    ##          pred.1 pred.0
## actual.1    163     49
## actual.0    359    929

    ## threshold of 10%
lda.counts.T01 <- misclassCounts(scores.lda>0.1,true.class)
lda.counts.T01$conf.matrix

    ##          pred.1 pred.0
## actual.1    182     30
## actual.0    604    684

Now, let's compare the values of sensitivity and specificity for three threshold values.


    ##          ER      Sens      Spec
## 1 0.1573333 0.3867925 0.9177019

    lda.counts.T02$metrics[c('ER', 'Sens','Spec')]

    ##     ER      Sens      Spec
## 1 0.21 0.6320755 0.8159938

    lda.counts.T017$metrics[c('ER', 'Sens','Spec')]

    ##      ER      Sens     Spec
## 1 0.242 0.7122642 0.765528

    lda.counts.T016$metrics[c('ER', 'Sens','Spec')]

    ##      ER     Sens      Spec
## 1 0.256 0.740566 0.7445652

    lda.counts.T015$metrics[c('ER', 'Sens','Spec')]

    ##      ER      Sens      Spec
## 1 0.272 0.7688679 0.7212733

    lda.counts.T01$metrics[c('ER', 'Sens','Spec')]

    ##          ER      Sens      Spec
## 1 0.4226667 0.8584906 0.5310559

When we plot the accuracy, sensitivity, and specificity for various threshold values, the three lines intersect at a particular point. The threshold corresponding to this point indicates the optimum threshold for the model.

discriminant-analysis-10.png

From the chart above, it is clear that the optimum threshold for the model is 16%. When we make predictions at this threshold, the accuracy, sensitivity, and specificity are 74%.

Insights From the Model and Business Recommendations

Based on the discriminant coefficients and the correl_ratio provided by the model, an increase in the below variables increases the probability of customer churn:

  1. Number of customer service calls
  2. Total day charge
  3. Total evening charge
  4. Total international charge
  5. Total night charge

Additionally, an increase in number of voice mail messages decreases the probability of customer churn.

These insights from the discriminant model can help the business formulate strategies to reduce customer churn. Here's what I would recommend to the business based on what we've learned: First, customer issues should be resolved within the first or second call, as repeated calls to customer service causes customer churn. Second, there should be an organized escalation procedure for issues not resolved within two calls. Lastly, the provider should offer more attractive plans that reduce the cost of day, evening, and international calls based on usage.

Sowmya Vivek
Sowmya Vivek

Sowmya Vivek is an independent consultant with a strong passion for analytics, machine learning, and change management. She has extensive experience in various senior-level positions across business intelligence, workflow optimization, content management, and e-learning content development.

Related Content