The general goal of machine learning is to build models that can learn from data without being explicitly programmed. Among the many subdomains of machine learning, the one that usually gets the most attention is what is known as supervised learning. It is the most accessible, especially for people new to the field, and provides a great introduction to the wider world of machine learning. The 'supervised' in supervised learning refers to the fact that each sample within the data being used to build the system contains an associated label. The goal is to build a model that can accurately predict the value of the label when presented with new data. More formally, if the data set contains features, denoted x, and labels, denoted Y, the supervised learning model takes the form 

f(x)
  

Where the label is assumed to be some general function of the input features. This function is general in the sense that it can be linear or non-linear, parametric or non-parametric, etc.

Outline

Here's a broad outline of what we're going to cover:

  • The two main types of supervised learning: regression and classification
  • How to choose an appropriate model
  • The general tradeoff between model accuracy and interpretability
  • A regression example using the Boston Housing dataset
  • A classification example using the UCI ML Breast Cancer Wisconsin dataset 

Regression Versus Classification

Supervised learning problems can be divided into two primary types, regression and classification. In regression problems, the labels are quantitative, or continuous in nature. Examples include:

  • Income in dollars
  • Weight in pounds
  • Distance in miles

In classification problems, the labels are qualitative, or categorical in nature, and can be grouped into two or more classes. Examples include:

  • Binary labels (Yes/No or 0/1)
  • Different brands of a product (A, B, C)
  • The weather on a given day (rainy, sunny, overcast)

In both cases, the features (x's) are different variables that we assume are related to the label in some way. For regression, if the label represents income, the features could be job title, years of experience, location, level of education, etc. For classification, if the label represents whether or not a passenger survived the sinking of the Titanic, the features could be age, gender, cabin class, etc. The exact form of the relationship between the features and label will depend on the type of model used. Regardless of the type of problem, the goal is to predict the value of the labels with an acceptable level of accuracy. The way to measure accuracy depends on whether the problem involves regression or classification, and the definition of an acceptable level of accuracy depends on the specific domain. 

Choosing an Appropriate Model

Within the areas of regression and classification, there are a wide variety of models to choose from. Choosing an appropriate model depends on a number of factors, including:

  • The size of the data, as some models perform better on larger or smaller data sets
  • The distribution of the data, as some models assume the features with a dataset follow a specific statistical distribution
  • The relationship between the features and labels (linear or non-linear, additive or multiplicative, etc.)
  • The format of the data
    • Structured data, such as a comma delimited text file, and whether the features are quantitative or qualitative
    • Unstructured data such as audio, video, or image files
  • The primary goal of the analysis, which is typically either prediction or inference

Model Accuracy Versus Interpretability

The last bullet hints at an important distinction between different supervised learning models, and that is the general tradeoff between accuracy and interpretability. Here, interpretability refers to the ability to see how a model arrived at a particular answer, or at a higher level, why the model made the decisions it did. This tradeoff can be viewed in terms of the overall flexibility of a model. Models that are less flexible tend to be less accurate, as they assume a somewhat rigid form of f(x), and can only produce a small range of estimates. Most real world phenomena do not follow such an explicit form, and thus the model will not be able to completely capture the underlying relationship between the features and label. However, because they are somewhat rigid in nature, these models provide a higher level of interpretability. Models that are more flexible tend to more more accurate, as they do not make explicit assumptions about the form of f(x), and can fit a wider variety of shapes to the data. Because they are more flexible, however, they often provide a lower level of interpretability.

Since this post is meant to serve as an introduction to supervised learning, our focus will be on interpretability when choosing a suitable model. 

Examples Using Scikit-Learn With Python

Now that we have a general idea about what supervised learning is, it's time for some examples to solidify the concepts that have been introduced so far. Both regression and classification examples will be given, both will be done in Python 2.7, and both will use the scikit-learn and pandas packages. Scikit-learn is a free machine-learning library that contains all of the functions we'll need for the examples, and pandas provides flexible data structures designed to make working with relational datasets easy. Finally, both examples will use datasets that come bundled with scikit-learn, so there is no need to visit an external source.

Scikit-learn: http://scikit-learn.org/stable/index.html

Pandas: https://pandas.pydata.org/ 

Regression Example

Our regression example will use the Boston Housing Prices dataset. Our goal is to predict the median price of a house in a suburb of the city given a set of features pertaining to the suburb. Because our goal is interpretability, we'll use linear regression as our model of choice. Despite being one of the oldest supervised learning methods, it is still useful, and quite widely used. In addition, understanding linear regression is essential to understanding more complex models like neural networks.

If we have a label Y and features X1 through X, the linear regression model is of the form

Y β0 β1X1 + β2X2  ...  + βpXp

  

Here, the β terms are unknown coefficients that will be determined by our specific data set. As a quick aside, a linear regression model assumes a linear relationship between the label and the coefficients of the features. This distinction is important because it is often wrongly assumed that the linear relationship is between the label and the features themselves. However, it is perfectly acceptable, and often helpful, to use non-linear features such as X1X2 or X12, if it improves the model. The resulting model is still linear, and all of the general rules regarding linear regression models apply.

Before we import our data, there are two questions we need to address.

How are the Β's determined?

The coefficients selected are those that minimize a quantity known as the residuasum of squares, or RSS. If we denote a true label as Y, a predicted label as Yˆ, and have a total of n samples, the RSS is defined as

Screen Shot 2018-11-08 at 5.43.31 PM

  

From the above equation, the minimum RSS is clearly achieved when the values between the true and predicted labels are as small as possible. The selected β values will be those that achieve the smallest delta between the true and predicted labels.

How do we measure the accuracy of our model?

There are many ways to measure the accuracy of a linear regression model. We're going to use what's known as the root mean squared error (RMSE), which is given by the equation 

 

 
Screen Shot 2018-11-08 at 5.44.26 PM


The RMSE can be thought of the as the square root of the 'average' RSS for each term. One advantage of using RMSE is that it is in the same units as the label. As with RSS, smaller values are better, but there isn't a cutoff for what's considered a 'good' value. Such a threshold depends on the specifics of the problem.

Boston Housing Data

Now that we've defined our model, let's import the dataset.

from sklearn.datasets import load_boston
boston = load_boston()
 

Before we start to explore the data, let's turn it into a pandas data frame, which is a table-like data structure with labeled rows and columns. We'll label the columns using the 'feature_names' property of the dataset.

import pandas as pd
boston_data = pd.DataFrame(boston.data, columns=boston.feature_names)
 

We can use the shape() function to see the size of the data frame.

boston_data.shape
 
(506, 13)
 

Shape lists rows, then columns. The way to interpret this is that each row represents a different suburb in the greater Boston area, and there are 506 suburbs in the dataset. Each column represents a different feature, and there are 13 features for each suburb.

We can look at the first few rows in the data frame using the head() function.

boston_data.head()
 
  CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
  0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
 

As noted earlier, there are 13 features for each suburb. Some of the features are:

  • CRIM - Per capita crime rate by town
  • INDUS - Proportion of non-retail business acres per town
  • NOX - Nitric oxides concentration (parts per 10 million)
  • AGE - Proportion of houses built before 1940
  • PTRATIO - Pupil-teacher ratio by town
 

Note that the median price is not one of the features. It is actually stored separately, so let's go ahead and add it to the data set. We can add a new column to our data frame using the syntax below, and note that the price is given in thousands of dollars, so we'll convert it to dollars.

boston_data['PRICE'] = boston.target * 1000
 

Now that we have all of our data in the data frame, we can view some basic statistics using the describe() function.

boston_data.describe().transpose()
 
  count mean std min 25% 50% 75% max
CRIM 506.0 3.593761 8.596783 0.00632 0.082045 0.25651 3.647423 88.9762
ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 12.500000 100.0000
INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 18.100000 27.7400
CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 0.000000 1.0000
NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 0.624000 0.8710
RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 6.623500 8.7800
AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 94.075000 100.0000
DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 5.188425 12.1265
RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 24.000000 24.0000
TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 666.000000 711.0000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 20.200000 22.0000
B 506.0 356.674032 91.294864 0.32000 375.377500 391.44000 396.225000 396.9000
LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 16.955000 37.9700
PRICE 506.0 22532.806324 9197.104087 5000.00000 17025.000000 21200.00000 25000.000000 50000.0000
 

Note that many of the features have different scales. This is important to recognize because many machine learning models are sensitive to the relative scaling of each feature, and it is often necessary to rescale the features to the same range. The most common ways to do this are to normalize each feature so that it ranges from 0 to 1, or standardize each feature so that it has zero mean and a standard deviation of one. For our example, the final result will be the same whether we scale or not, but it will make the coefficients more interpretable if we do.

Training Data Versus Test Data

Before we scale our data, we need to address one of the most important parts of supervised learning. We mentioned earlier that our goal is to predict the median house price using the data set, but we didn't say how we were going to go about doing that. The way we're going to do it is to split our data set into two groups, one for training our model, and one for testing it. It's important to set aside some data for testing because we need to get a sense of how our model will perform on data it has never seen before, which is what it would do if it were used in a real production environment. Because our model has already seen the training data, it would not be a good idea to predict prices using that same data. We would expect the model to perform well, and that would give us an over-optimistic estimate of our model's performance ability. The real test is to use data that is new, and that's the purpose of keeping a separate set of data specifically for testing. We want to keep our test data pristine, so we'll split it away from the training data before we do any scaling.

 

The first thing to do is split the data back apart into features (X) and labels (y). Then, we can use the 'train_test_split' function from scikit-learn to randomly split our data into training and testing sets. Note that this split should always be random, in case the data is ordered in some way. A common split is to allocate 70-80% for training, and the rest for testing. Also, because the split is random, we are highly likely to generate training and testing sets that both capture the same underlying relationship between the features and labels.

X = boston_data.iloc[:,:-1]
y = boston_data['PRICE']
from sklearn.model_selection import train_test_split

# Split the data into 80% training and 20% testing. 
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)

print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape) 
 
Training data size: (404,13)
Testing data size: (102,13) 

Scaling the Features

Now we can use the 'StandardScaler' function from scikit-learn to scale the training data so that each feature has a mean of zero and unit standard deviation. We'll apply this same scale to the test data. Note that the test data should never be scaled using its own data (think about a scenario where you had to predict the price of a single suburb, how would you scale a single sample?).



from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
 

As a check, let's print the mean and standard deviation of the training data.

print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
 
Training set mean by feature:
[ -5.93584587e-17  -4.17707673e-17   7.03507659e-17  -4.83661516e-17
  -1.73678453e-16  -3.03387678e-16  -3.15479216e-16   0.00000000e+00
   9.45338417e-17  -4.39692287e-17   3.36364600e-16  -3.14379985e-16
   2.02258452e-16]
Training set standard deviation by feature:
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
 

As expected, they are equal to zero, and one, respectively.

Training Our Model

It's finally time to build our linear regression model using the training data. This is quite simple, and just involves creating a LinearRegression model object and one call to its 'fit' method.



from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train,y_train)
 
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 

Interpreting The Coefficients

It was mentioned above that the linear regression model assumes that the median home price is a linear combination of the various features, with coefficients determined by calling the 'fit' method. The LinearRegression model object stores those values for us, so let's take a look.



intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_, index=boston.feature_names, columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
 
Intercept = 22452.475248

         Coefficients
CRIM      -797.292486
ZN        1076.530798
INDUS     -300.966686
CHAS       694.329854
NOX      -1729.254032
RM        2761.795061
AGE       -403.233095
DIS      -3223.486941
RAD       2720.184752
TAX      -1947.419925
PTRATIO  -1916.685459
B         1092.865681
LSTAT    -3325.234011
 

There is a lot to be learned by studying these coefficients. First, there's the intercept term (β0), which is equal to the mean home price among all suburbs in the training data set when all of the other coefficients are set equal to their mean values (which are all zero in this case).

Another important detail is the sign of the coefficients. A positive coefficient means that the median home price increases as the corresponding feature increases. On the other hand, a negative coefficient means that the median home price decreases as the corresponding feature increases. As a first order check, let's see if some of these values make sense:

  • CRIM - An increase in the crime rate corresponds to a decrease in median home price
  • RM - An increase in the average number of rooms per home corresponds to an increase in median home price
  • AGE - An increase in the proportion of houses built before 1940 corresponds to a decrease in median home price
  • RAD - An increase in accessibility to radial highways corresponds to a increase in median home price
  • PTRATIO - An increase in the pupil-teacher ratio (meaning more students in each class) corresponds to a decrease in median home price

All of these trends make intuitive sense. In addition, because we scaled our data, each coefficient can be interpreted as the average effect on the median price given a one unit increase in the corresponding feature while holding all other features fixed. In that sense, we can see that factors like the number of rooms per home (RM) and access to highways (RAD) have the largest positive effect on median home price, while factors like the distance to local employment centers (DIS) and percent of the population that qualifies as 'lower status' (LSTAT) have the largest negative effect.

Testing the Model on New Data

Now that we've built our model, we can check its performance on the test data set we set aside earlier. We do that by using the 'predict' function within the LinearRegression model class. In addition, let's compute the RMSE on the test data using the formula shown earlier. Not surprisingly, scikit-learn has a built in function for that as well. In the code below, 'y_pred' contains the predicted home prices, and 'y_test' contains the true values (labels) from the test data set.



from sklearn.metrics import mean_squared_error
import numpy as np

y_pred = regression_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print('Test RMSE: %f' % test_rmse)
 
Test RMSE: 4752.805517
 

This value means that on average, the error in the predicted median price is approximately $4,800 dollars. Given that the home prices range from $5,000 to $50,0000, this is a non-trivial difference. Two possible reasons for this difference include:

  • The relationship between the features and response is not perfectly linear (this is most certainly true).
  • Some of the features we included were not actually correlated with the median price. Adding additional complexity without improving the model can lead to what is known as overfitting, where the model performs well on the training data but does not generalize well to new data.

There are other potential sources of error, but those have to do with the specific assumptions regarding linear regression models (such a multicollinearity between features, the presence of heteroscedasticity, and the distribution of the residuals between the predicted and actual prices) and are beyond the scope of this tutorial. However, there is one plot we can make which will give us a sense of how well our model fit the data, and that is a plot of the predicted versus actual home prices. The red line has a slope of one, and represents the line where the predicted price would be identical to the actual price.

import matplotlib.pyplot as plt

plt.scatter(y_test,y_pred)
plt.plot([,50000],[,50000],'r',lw=2)
plt.xlabel('Actual Price (Dollars)')
plt.ylabel('Predicted Price (Dollars)')
plt.show()
 
 

If our model had a test RMSE of zero, we would expect every blue dot to land perfectly on the red line. This is not the case of course, and this plot tells us that our model tends to under predict home prices at the lower and higher ends of the price range, while prices in the middle are somewhat equally distributed above and below the perfect fit line.

Overall, our model did a satisfactory job of predicting the median home price, especially for a first effort. Plus, we learned which features are most influential, and which contribute to an increase or decrease in median home price, which can be just as valuable as being able to predict the price itself.

Classification Example

Our classification example will use the UCI ML Breast Cancer Wisconsin dataset, and our goal is to predict the whether or not a mass is benign or malignant given a set of features based on a digital image of the mass. Because our goal is interpretability, we'll use logistic regression as our model of choice. As was the case with linear regression, despite being one of the older supervised learning methods, it is still useful, and quite widely used.

Given that the name of this model is similar to linear regression, you'd be right to think that there is some similarity between the two. In this case, rather than predicting a quantitative output, we are predicting a qualitative one, specifically whether a mass of cells is benign or malignant. If we designate each one of these labels as a class, with values of 0 for malignant and 1 for benign, what we'd really like is for our model to return a probability of each mass belonging to the benign class. That is, we'd like it to output 

Screen Shot 2018-11-08 at 5.46.51 PM

 

Where the right hand side is the conditional probability that the value of the label is equal to 1 (i.e., benign) given the particular features of the sample.

If we once again have features, we can try to use the linear regression model from the previous example, in which case we end up with 

Screen Shot 2018-11-08 at 5.47.47 PM

  

The problem here is that we need our estimates to be valid probabilities (i.e., between 0 and 1), but as we saw, the right hand side outputs continuous values over a wide range. What we need is a function that will always return values between 0 and 1, and the logistic function does just that. The logistic function is defined as

Screen Shot 2018-11-08 at 5.48.20 PM

  

A plot of it is shown below:


x = np.linspace(-20,20,100)
y = 1/(1+np.exp(-x))
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('The Logistic Function')
plt.show()
 
 

This function is perfect for us, as large negative values get mapped to zero, and large positive values get mapped to one. Given that, the logistic regression model is given as

Screen Shot 2018-11-08 at 5.48.44 PM

  

As in the previous example, the β terms are unknown coefficients that will be determined by our specific data set.

The default threshold for classification is p(X) = 0.5, and note that in the graph above, y, or p(X), is 0.5 when x is equal to 0. This means that when the expression inside the exponent of the equation above is greater than 0 (corresponding to p(X) > 0.5), the mass is classified as benign (class 1). When it's less than zero (corresponding to p(X) < 0.5), the mass is classified as malignant (class 0). This default threshold may not always be appropriate, for example in this case you may want to classify masses as malignant using a lower threshold. For this example however, we're going to stick with the default.

Before we import our data, let's address the same two questions we asked in the regression example.

How are the Β's determined?

In this case, the coefficients are determined using a method called maximum likelihood. Although the details are beyond the scope of this tutorial, the method works by finding the values of the β's such that the output of the model is close to zero for all malignant class examples, and close to one for all benign class examples. The β's are chosen such that they maximize what is known as the likelihood function.

How do we measure the accuracy of our model?

It was relatively straightforward to determine the accuracy of our model in the regression example. For classification, things get a bit more complicated. There are many different ways to measure the accuracy of a classifier, which metric to use depends on the specific problem. The most basic approach is to measure the error rate, which is simply the percentage of correct classifications 

 

Screen Shot 2018-11-08 at 5.50.25 PM

  

Here, n again refers to the number of samples in our data set. The function inside the summation simply counts the number of samples for which the class was correctly predicted. Dividing by the number of samples converts this count into a fraction, which can be interpreted as the accuracy of the classifier.

Another way to measure accuracy requires a more specific definition of correct and incorrect predictions. Consider the following terms:

  • A true positive classification is one where we correctly predicted that a sample belonged to the positive class (in this case, we'll call the malignant class positive).
  • A true negative classification is one where we correctly predicted that a sample belonged to the negative class (in this case, we'll call the benign class negative).
  • A false positive classification is one where we incorrectly predicted that a sample belonged to the positive class (in this case, we said the mass was malignant when it was actually benign).
  • A false negative classification is one where we incorrectly predicted that a sample belonged to the negative class (in this case, we said the mass was benign when it was actually malignant).

Depending on the problem, you may be more concerned with tracking the number of false positives or false negatives, rather than the overall accuracy. The accuracy metric assumes that true positive and true negative classifications are equally important. In many cases, including fraud detection and cancer diagnoses, false negatives are much more dangerous than false positives.

With these new terms defined, we can compute what is known as the confusion matrix for our classifier. For a binary classifier such as the one we're going to create, the confusion matrix lists the total count of each of the four types of classifications after a set of predictions has been made. From there, a variety of metrics can be calculated depending on the problem.

UCI ML Breast Cancer Wisconsin Data

Now that we've defined our model, let's import the dataset.

from sklearn.datasets import load_breast_cancer
breast_cancer_data = load_breast_cancer()
 

As before, let's turn the data set into a pandas data frame, and label the columns using the 'feature_names' property of the dataset.

bc = pd.DataFrame(breast_cancer_data.data)
bc.columns = breast_cancer_data.feature_names
 

We can use the shape() function to see the size of the data frame.

bc.shape
 
(569, 30)
 

For this data set, each row represents a different digital image of a mass, and there are 569 total images in the dataset. Each column represents a different feature, and there are 30 features for each mass. We can look at the first few rows in the data frame using the head() function.

bc.head()
 
  mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
  17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

 

As noted earlier, there are 30 features for each mass. These features relate to the description of the mass based on the digital image. Some of the features describe characteristics like:

  • Radius
  • Texture
  • Perimeter
  • Area
  • Smoothness
  • Symmetry
 

As before, the labels (class) are not one of the features. We can add a new column to our data frame using the same method as before.

bc['class'] = breast_cancer_data.target
 

Let's take a look at the class counts, which corresponds to the number of benign and malignant masses. We can do this using the 'value_counts' function within pandas.

pd.value_counts(bc['class'])
 
1    357
0    212
Name: class, dtype: int64
 

There are 212 malignant masses (class 0), and 357 benign masses (class 1) in our data set. We need to be careful when calculating our confusion matrix, in terms of which class is considered positive and negative, but we'll address that when the time comes.

Let's look at some basic statistics, using the describe() function as before.

bc.describe().transpose()
 
  count mean std min 25% 50% 75% max
mean radius 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
mean texture 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
mean perimeter 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
mean area 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
mean smoothness 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
mean compactness 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
mean concavity 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
mean concave points 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
mean symmetry 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
mean fractal dimension 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius error 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture error 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter error 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area error 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness error 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness error 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity error 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points error 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry error 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal dimension error 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
worst radius 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
worst texture 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
worst perimeter 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
worst area 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
worst smoothness 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
worst compactness 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
worst concavity 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
worst concave points 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
worst symmetry 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
worst fractal dimension 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750
class 569.0 0.627417 0.483918 0.000000 0.000000 1.000000 1.000000 1.00000
 

As with the regression example, many of the features have different scales. This time, it is very important that we scale our features. The reason has to do with the specifics of the logistic regression model in scikit-learn. The model performs regularizatioby default, which is sensitive to the relative values of the coefficients, and helps control overfitting, which was described during the regression section. Before we scale the features, let's briefly discuss classifier decision boundaries.

Classifier Decision Boundaries

As discussed above, the ultimate goal in classification is to correctly predict which class each sample belongs to. This is equivalent to defining a geometric boundary where samples are classified depending on which side of the boundary they fall. This can be made more clear using an example from our data set. Consider the figure below, which plots the 'mean radius' feature with each sample colored by class (benign samples are yellow, malignant are purple).

x = range(len(bc['mean radius']))
y = bc['mean radius']
plt.scatter(x,y,c=bc['class'])
plt.xlabel('sample')
plt.ylabel('mean radius')
plt.show()

supervised image

If we were trying to classify a sample using just this feature, a good boundary could be a value of 12.5 for the mean radius. Any sample with a mean radius less than 12.5 is classified as benign, and any sample with a mean radius greater than 12.5 is classified as malignant. It's not a perfect classification, but it's a start. Since we're using not one but 30 features, our classifier will create a analogous boundary in higher dimensional space to separate the samples. It could be that not all features are useful in separating the classes, which is something we could investigate once our model has been built.

Training Data Versus Test Data

As before, we need to split the data back apart into features (X) and labels (y). Then, we can use the 'train_test_split' function from scikit-learn to randomly split our data into training and testing sets.

X = bc.iloc[:,:-1]
y = bc['class']
# Split the data into 80% training and 20% testing. 
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)

print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape) 
 
# Split the data into 80% training and 20% testing. 
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)

print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape) 
 
Training data size: (455,30)
Testing data size: (114,30) 

Scaling the Features

We'll scale our training data once again using the 'StandardScaler' function so that each feature has a mean of zero and unit standard deviation. We'll apply this same scale to the test data.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
 

Let's double check to make sure the scaling worked as intended.

print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
 
Training set mean by feature:
[ -2.75628116e-15   4.59705534e-16  -5.40227204e-16   6.52957542e-16
   6.25189766e-15  -3.13644105e-15  -2.56205313e-16  -1.11607915e-15
  -3.54783358e-15   3.99192279e-15   8.92387507e-16  -4.62389589e-16
  -1.18725237e-15  -9.41005515e-16   1.29859493e-15   1.24088773e-15
   7.30795156e-16  -1.70315532e-16  -2.79580998e-15  -3.99436284e-16
  -1.29859493e-15   4.74785046e-15  -8.00824608e-16  -5.39495188e-16
   4.93427033e-15   1.86383265e-15  -9.58451877e-16  -1.65923441e-17
  -2.49739179e-15  -1.38546073e-15]
Training set standard deviation by feature:
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.] 

Training Our Model

It's finally time to build our logistic regression model using the training data. Similar to the linear regression model, this just involves creating a LogisticRegression model object and one call to its 'fit' method.

from sklearn.linear_model import LogisticRegression

regression_model = LogisticRegression()
regression_model.fit(X_train,y_train)
 
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 

Interpreting The Coefficients

The LogisticRegression model object also stores the coefficient values for us, so let's take a look.

intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_.transpose(), 
                    index=breast_cancer_data.feature_names, 
                    columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
 
Intercept = 0.493176

                         Coefficients
mean radius                 -0.620034
mean texture                -0.504767
mean perimeter              -0.564625
mean area                   -0.646708
mean smoothness             -0.131364
mean compactness             0.460890
mean concavity              -0.800368
mean concave points         -0.771655
mean symmetry                0.168892
mean fractal dimension       0.553119
radius error                -1.144201
texture error                0.337015
perimeter error             -0.569919
area error                  -0.946128
smoothness error            -0.405995
compactness error            0.610680
concavity error             -0.154262
concave points error         0.078581
symmetry error               0.249923
fractal dimension error      0.731013
worst radius                -1.150628
worst texture               -1.192783
worst perimeter             -0.880301
worst area                  -1.086891
worst smoothness            -0.994301
worst compactness           -0.007784
worst concavity             -0.752380
worst concave points        -0.752560
worst symmetry              -0.703783
worst fractal dimension     -0.449378
 

We have to be a bit more careful when interpreting the coefficients of a logistic regression model. We want to be able to articulate the effect of each coefficient on the output, and recall that our model outputs a probability of the form

 
Screen Shot 2018-11-08 at 5.52.37 PM

 

In this form, it is hard to specify the effect of each coefficient on the output probability. What we really want is an expression involving only the coefficients and features on the right hand side. With a bit of manipulation, we can express the equation for our model as 

 
Screen Shot 2018-11-08 at 5.53.03 PM

  

The quantity on the left hand side is called the odds, and is defined as the ratio between the probability of a given event occurring and not occurring. Unlike probability, odds can range from 0 to infinity. Events with higher probabilities have higher odds, and vice versa. If we take the log of the above equation, we arrive at the following

Screen Shot 2018-11-08 at 5.53.40 PM

 

The expression on the left hand side is called the log-odds, and this is what we were after, as the right side involves only the coefficients and features. Now, we see how to properly interpret the effect of each coefficient on the output. We can say that a one-unit increase in a given feature (while holding all other features fixed) increases the log-odds by an amount equal to its particular coefficient.

Although the actual change in p(X) caused by a one-unit increase in a particular feature is harder to quantify, we can still interpret the sign of the coefficient in the same manner as before. That is, a positive coefficient is associated with increasing the value of p(X) as X increases, and likewise a negative coefficient is associated with decreasing the value of p(X) as X increases. Recall that since malignant masses are coded as 0, and benign as 1, an increase in p(X) corresponds with a higher chance of the mass being benign. Give that, let's take a closer look at some of our coefficients:

  • Most coefficients related to measures of size (radius, perimeter, area) are negative, indicating that an increase is related to an increased risk of a mass being classified as malignant. This suggests that larger masses may be more often malignant, which makes sense.
  • The feature corresponding to compactness is positive, indicating that an increase in compactness (i.e., a smaller mass), is related to a decreased risk of a mass being classified as malignant. This also makes sense. 

Testing the Model on New Data

Now that we've built our model, we can check its performance on the test data set we set aside earlier. As before, we will do that by using the 'predict' function. In addition, let's compute the error rate (i.e., accuracy) on the test data using the formula shown earlier. Not surprisingly, scikit-learn has a built in function for that as well. In the code below, 'y_pred' contains the predicted class, and 'y_test' contains the true class (labels) from the test data set.



from sklearn.metrics import accuracy_score

y_pred = regression_model.predict(X_test)
test_acc = accuracy_score(y_test,y_pred)*100

print('The test set accuracy is %4.2f%%' % test_acc)
 
The test set accuracy is 96.49%
 

This is an impressive result, over 95% accuracy using a relatively simple model with the default parameters. Recall our discussion earlier regarding the different ways of measuring accuracy in a classification model. We introduced something known as a confusion matrix, so let's go ahead and print it below using the 'confusion_matrix' function from scikit-learn's 'metrics' module.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test,y_pred, labels=[1,])
print(conf_matrix)
 
[[66  1]
 [ 3 44]]
 

The confusion matrix is interpreted as follows:

  • The upper left term contains number of true negatives in the test set. A true negative is where both the predicted and actual label was negative (benign). In our case, there were 66 true negatives.
  • The lower right term contains number of true positives in the test set. A true positive is where both the predicted and actual label was positive (malignant). In our case, there were 44 true positives.
  • The lower left term contains the number of false negatives in the test set. A false negative is where the predicted label was negative (benign), but the actual label was positive (malignant). In our case, there were 3 false negatives, which are quite dangerous in this context.
  • The upper right term contains the number of false positives in the test set. A false positive is where the predicted label was positive (malignant), but the actual label was negative (benign). In our case, there was only 1 false positive. While stressful for a patient (until repeated tests can be performed), false positives are far less dangerous than false negatives.

Let's assign the various components to variables and calculate a few more metrics.

# True negatives
TN = conf_matrix[][]
# True positives
TP = conf_matrix[1][1]
# False negatives
FN = conf_matrix[1][]
# False positives
FP = conf_matrix[][1]
 

The sensitivity, also known as recall, or true positive rate (TPR), is given as 

Screen Shot 2018-11-08 at 5.54.52 PM

A way to interpret the sensitivity is that, out of all positive results (a combination of true positive and false negatives), how many did we correctly predict? In our case, the sensitivity is equal to

TPR = float(TP)/(TP+FN)
print('TPR = %4.2f%%' % (TPR*100))
 
TPR = 93.62%
 

The specificity, also known as the true negative rate (TNR), is given as

 
Screen Shot 2018-11-08 at 5.55.22 PM 
 

A way to interpret the sensitivity is that, out of all negative results (a combination of true negatives and false positives), how many did we correctly predict? In our case, the specificity is equal to



TNR = float(TN)/(TN+FP)
print('TNR = %4.2f%%' % (TNR*100))
 
TNR = 98.51%
 

The precision, or positive predictive value (PPV), is given as

Screen Shot 2018-11-08 at 5.55.54 PM

   

A way to interpret the precision is that, out of all the results we said were positive (a combination of true positives and false positives), how many did we correctly predict? In our case, the precision is equal to



PPV = float(TP)/(TP+FP)
print('PPV = %4.2f%%' % (PPV*100))
 
PPV = 97.78%
 

Finally, the negative predictive value (NPV), is given as 

Screen Shot 2018-11-08 at 5.56.28 PM
 

A way to interpret the NPV is that, out of all the results we said were negative (a combination of true negatives and false negatives), how many did we correctly predict? In our case, the NPV is equal to



NPV = float(TN)/(TN+FN)
print('NPV = %4.2f%%' % (NPV*100))
 
NPV = 95.65%
 

As you can see, calculating the accuracy of the model doesn't always tell the whole story. Because we had three false negatives, our TPR and NPV were lower than our accuracy. On the other hand, because we only had one false positive, our TNR and PPV were higher. Some tweaking of the model could help adjust these values, for example it may be desired to decrease the number of false negatives. This could be done by adjusting the parameters of the logistic regression model (we used the defaults), and the decision threshold itself (recall that the default probability threshold of 0.5 may not be appropriate in this case).

Conclusion

Hopefully this tutorial served as a good introduction to supervised learning with Python. We were introduced to the scikit-learn package, which provides a lot of very powerful machine learning related functionality, and is a great place to start experimenting. We also saw the pandas package, which supports flexible data structures and is designed to make working with datasets easy. Along the way, we learned a bit of theory regarding model selection, some tradeoffs to consider when using different models, the importance of keeping separate training and testing sets, and why it's usually a good idea to scale your data. In addition, we got to see examples of both regression and classification problems, ways to evaluate the performance of each, and how the results could be improved.

Greg DeVore
Author
Greg DeVore

Greg DeVore is an engineer at Boeing, where he spends his time developing computational, data analysis, and data visualization software to aid in the airplane design process. In addition to masters’ degrees in aerospace engineering and applied mathematics, he has completed programs in data science and machine learning through the University of Washington.