The general goal of machine learning is to build models that can learn from data without being explicitly programmed. Among the many subdomains of machine learning, the one that usually gets the most attention is what is known as supervised learning. It is the most accessible, especially for people new to the field, and provides a great introduction to the wider world of machine learning. The 'supervised' in supervised learning refers to the fact that each sample within the data being used to build the system contains an associated label. The goal is to build a model that can accurately predict the value of the label when presented with new data. More formally, if the data set contains features, denoted *x*, and labels, denoted *Y*, the supervised learning model takes the form

*Y*=

*f*(

*x*)

Where the label is assumed to be some general function of the input features. This function is general in the sense that it can be linear or non-linear, parametric or non-parametric, etc.

## Outline

Here's a broad outline of what we're going to cover:

- The two main types of supervised learning: regression and classification
- How to choose an appropriate model
- The general tradeoff between model accuracy and interpretability
- A regression example using the Boston Housing dataset
- A classification example using the UCI ML Breast Cancer Wisconsin dataset

## Regression Versus Classification

Supervised learning problems can be divided into two primary types, regression and classification. In regression problems, the labels are quantitative, or continuous in nature. Examples include:

- Income in dollars
- Weight in pounds
- Distance in miles

In classification problems, the labels are qualitative, or categorical in nature, and can be grouped into two or more classes. Examples include:

- Binary labels (Yes/No or 0/1)
- Different brands of a product (A, B, C)
- The weather on a given day (rainy, sunny, overcast)

In both cases, the features (*x*'s) are different variables that we assume are related to the label in some way. For regression, if the label represents income, the features could be job title, years of experience, location, level of education, etc. For classification, if the label represents whether or not a passenger survived the sinking of the Titanic, the features could be age, gender, cabin class, etc. The exact form of the relationship between the features and label will depend on the type of model used. Regardless of the type of problem, the goal is to predict the value of the labels with an acceptable level of accuracy. The way to measure accuracy depends on whether the problem involves regression or classification, and the definition of an acceptable level of accuracy depends on the specific domain.

## Choosing an Appropriate Model

Within the areas of regression and classification, there are a wide variety of models to choose from. Choosing an appropriate model depends on a number of factors, including:

- The size of the data, as some models perform better on larger or smaller data sets
- The distribution of the data, as some models assume the features with a dataset follow a specific statistical distribution
- The relationship between the features and labels (linear or non-linear, additive or multiplicative, etc.)
- The format of the data
- Structured data, such as a comma delimited text file, and whether the features are quantitative or qualitative
- Unstructured data such as audio, video, or image files

- The primary goal of the analysis, which is typically either prediction or inference

## Model Accuracy Versus Interpretability

The last bullet hints at an important distinction between different supervised learning models, and that is the general tradeoff between accuracy and interpretability. Here, interpretability refers to the ability to see how a model arrived at a particular answer, or at a higher level, why the model made the decisions it did. This tradeoff can be viewed in terms of the overall flexibility of a model. Models that are less flexible tend to be less accurate, as they assume a somewhat rigid form of f(x), and can only produce a small range of estimates. Most real world phenomena do not follow such an explicit form, and thus the model will not be able to completely capture the underlying relationship between the features and label. However, because they are somewhat rigid in nature, these models provide a higher level of interpretability. Models that are more flexible tend to more more accurate, as they do not make explicit assumptions about the form of f(x), and can fit a wider variety of shapes to the data. Because they are more flexible, however, they often provide a lower level of interpretability.

Since this post is meant to serve as an introduction to supervised learning, our focus will be on interpretability when choosing a suitable model.

## Examples Using Scikit-Learn With Python

Now that we have a general idea about what supervised learning is, it's time for some examples to solidify the concepts that have been introduced so far. Both regression and classification examples will be given, both will be done in Python 2.7, and both will use the scikit-learn and pandas packages. Scikit-learn is a free machine-learning library that contains all of the functions we'll need for the examples, and pandas provides flexible data structures designed to make working with relational datasets easy. Finally, both examples will use datasets that come bundled with scikit-learn, so there is no need to visit an external source.

Scikit-learn: http://scikit-learn.org/stable/index.html

Pandas: https://pandas.pydata.org/

## Regression Example

Our regression example will use the Boston Housing Prices dataset. Our goal is to predict the median price of a house in a suburb of the city given a set of features pertaining to the suburb. Because our goal is interpretability, we'll use linear regression as our model of choice. Despite being one of the oldest supervised learning methods, it is still useful, and quite widely used. In addition, understanding linear regression is essential to understanding more complex models like neural networks.

If we have a label *Y* and features *X*_{1} through *X** _{p }*, the linear regression model is of the form

*Y*=

*β*

_{0}+

*β*

_{1}

*X*

_{1}+

*β*

_{2}

*X*

_{2}+ ... +

*β*

_{p}*X*

_{p}

Here, the *β *terms are unknown coefficients that will be determined by our specific data set. As a quick aside, a linear regression model assumes a linear relationship between the label and the *coefficients* of the features. This distinction is important because it is often wrongly assumed that the linear relationship is between the label and the features themselves. However, it is perfectly acceptable, and often helpful, to use non-linear features such as *X*_{1}*X*_{2} or *X*_{1}^{2}, if it improves the model. The resulting model is still linear, and all of the general rules regarding linear regression models apply.

Before we import our data, there are two questions we need to address.

### How are the Β's determined?

The coefficients selected are those that minimize a quantity known as the *residual sum of **square*s, or RSS. If we denote a true label as *Y*, a predicted label as *Yˆ*, and have a total of *n* samples, the RSS is defined as

From the above equation, the minimum RSS is clearly achieved when the values between the true and predicted labels are as small as possible. The selected *β* values will be those that achieve the smallest delta between the true and predicted labels.

### How do we measure the accuracy of our model?

There are many ways to measure the accuracy of a linear regression model. We're going to use what's known as the root mean squared error (RMSE), which is given by the equation

The RMSE can be thought of the as the square root of the 'average' RSS for each term. One advantage of using RMSE is that it is in the same units as the label. As with RSS, smaller values are better, but there isn't a cutoff for what's considered a 'good' value. Such a threshold depends on the specifics of the problem.

```
from sklearn.datasets import load_boston
boston = load_boston()
```

Before we start to explore the data, let's turn it into a pandas data frame, which is a table-like data structure with labeled rows and columns. We'll label the columns using the 'feature_names' property of the dataset.

```
import pandas as pd
boston_data = pd.DataFrame(boston.data, columns=boston.feature_names)
```

We can use the shape() function to see the size of the data frame.

`boston_data.shape`

(506, 13)

Shape lists rows, then columns. The way to interpret this is that each row represents a different suburb in the greater Boston area, and there are 506 suburbs in the dataset. Each column represents a different feature, and there are 13 features for each suburb.

We can look at the first few rows in the data frame using the head() function.

`boston_data.head()`

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |

As noted earlier, there are 13 features for each suburb. Some of the features are:

- CRIM - Per capita crime rate by town
- INDUS - Proportion of non-retail business acres per town
- NOX - Nitric oxides concentration (parts per 10 million)
- AGE - Proportion of houses built before 1940
- PTRATIO - Pupil-teacher ratio by town

Note that the median price is not one of the features. It is actually stored separately, so let's go ahead and add it to the data set. We can add a new column to our data frame using the syntax below, and note that the price is given in thousands of dollars, so we'll convert it to dollars.

`boston_data['PRICE'] = boston.target * 1000`

Now that we have all of our data in the data frame, we can view some basic statistics using the describe() function.

`boston_data.describe().transpose()`

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

CRIM | 506.0 | 3.593761 | 8.596783 | 0.00632 | 0.082045 | 0.25651 | 3.647423 | 88.9762 |

ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 | 100.0000 |

INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |

CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |

NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 | 0.8710 |

RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |

AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |

DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |

RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 | 24.0000 |

TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |

PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 | 22.0000 |

B | 506.0 | 356.674032 | 91.294864 | 0.32000 | 375.377500 | 391.44000 | 396.225000 | 396.9000 |

LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |

PRICE | 506.0 | 22532.806324 | 9197.104087 | 5000.00000 | 17025.000000 | 21200.00000 | 25000.000000 | 50000.0000 |

Note that many of the features have different scales. This is important to recognize because many machine learning models are sensitive to the relative scaling of each feature, and it is often necessary to rescale the features to the same range. The most common ways to do this are to normalize each feature so that it ranges from 0 to 1, or standardize each feature so that it has zero mean and a standard deviation of one. For our example, the final result will be the same whether we scale or not, but it will make the coefficients more interpretable if we do.

### Training Data Versus Test Data

Before we scale our data, we need to address one of the most important parts of supervised learning. We mentioned earlier that our goal is to predict the median house price using the data set, but we didn't say how we were going to go about doing that. The way we're going to do it is to split our data set into two groups, one for training our model, and one for testing it. It's important to set aside some data for testing because we need to get a sense of how our model will perform on data it has never seen before, which is what it would do if it were used in a real production environment. Because our model has already seen the training data, it would not be a good idea to predict prices using that same data. We would expect the model to perform well, and that would give us an over-optimistic estimate of our model's performance ability. The real test is to use data that is new, and that's the purpose of keeping a separate set of data specifically for testing. We want to keep our test data pristine, so we'll split it away from the training data before we do any scaling.

The first thing to do is split the data back apart into features (*X*) and labels (*y*). Then, we can use the 'train_test_split' function from scikit-learn to randomly split our data into training and testing sets. Note that this split should always be random, in case the data is ordered in some way. A common split is to allocate 70-80% for training, and the rest for testing. Also, because the split is random, we are highly likely to generate training and testing sets that both capture the same underlying relationship between the features and labels.

```
X = boston_data.iloc[:,:-1]
y = boston_data['PRICE']
```

```
from sklearn.model_selection import train_test_split
# Split the data into 80% training and 20% testing.
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)
print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape)
```

Training data size: (404,13) Testing data size: (102,13)

### Scaling the Features

Now we can use the 'StandardScaler' function from scikit-learn to scale the training data so that each feature has a mean of zero and unit standard deviation. We'll apply this same scale to the test data. Note that the test data should never be scaled using its own data (think about a scenario where you had to predict the price of a single suburb, how would you scale a single sample?).

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

As a check, let's print the mean and standard deviation of the training data.

```
print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
```

Training set mean by feature: [ -5.93584587e-17 -4.17707673e-17 7.03507659e-17 -4.83661516e-17 -1.73678453e-16 -3.03387678e-16 -3.15479216e-16 0.00000000e+00 9.45338417e-17 -4.39692287e-17 3.36364600e-16 -3.14379985e-16 2.02258452e-16] Training set standard deviation by feature: [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

As expected, they are equal to zero, and one, respectively.

```
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()
regression_model.fit(X_train,y_train)
```

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpreting The Coefficients

It was mentioned above that the linear regression model assumes that the median home price is a linear combination of the various features, with coefficients determined by calling the 'fit' method. The LinearRegression model object stores those values for us, so let's take a look.

```
intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_, index=boston.feature_names, columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
```

Intercept = 22452.475248 Coefficients CRIM -797.292486 ZN 1076.530798 INDUS -300.966686 CHAS 694.329854 NOX -1729.254032 RM 2761.795061 AGE -403.233095 DIS -3223.486941 RAD 2720.184752 TAX -1947.419925 PTRATIO -1916.685459 B 1092.865681 LSTAT -3325.234011

There is a lot to be learned by studying these coefficients. First, there's the intercept term (β0), which is equal to the mean home price among all suburbs in the training data set when all of the other coefficients are set equal to their mean values (which are all zero in this case).

Another important detail is the sign of the coefficients. A positive coefficient means that the median home price increases as the corresponding feature increases. On the other hand, a negative coefficient means that the median home price decreases as the corresponding feature increases. As a first order check, let's see if some of these values make sense:

- CRIM - An increase in the crime rate corresponds to a decrease in median home price
- RM - An increase in the average number of rooms per home corresponds to an increase in median home price
- AGE - An increase in the proportion of houses built before 1940 corresponds to a decrease in median home price
- RAD - An increase in accessibility to radial highways corresponds to a increase in median home price
- PTRATIO - An increase in the pupil-teacher ratio (meaning more students in each class) corresponds to a decrease in median home price

All of these trends make intuitive sense. In addition, because we scaled our data, each coefficient can be interpreted as the average effect on the median price given a one unit increase in the corresponding feature while holding all other features fixed. In that sense, we can see that factors like the number of rooms per home (RM) and access to highways (RAD) have the largest positive effect on median home price, while factors like the distance to local employment centers (DIS) and percent of the population that qualifies as 'lower status' (LSTAT) have the largest negative effect.

### Testing the Model on New Data

Now that we've built our model, we can check its performance on the test data set we set aside earlier. We do that by using the 'predict' function within the LinearRegression model class. In addition, let's compute the RMSE on the test data using the formula shown earlier. Not surprisingly, scikit-learn has a built in function for that as well. In the code below, 'y_pred' contains the predicted home prices, and 'y_test' contains the true values (labels) from the test data set.

```
from sklearn.metrics import mean_squared_error
import numpy as np
y_pred = regression_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('Test RMSE: %f' % test_rmse)
```

Test RMSE: 4752.805517

This value means that on average, the error in the predicted median price is approximately $4,800 dollars. Given that the home prices range from $5,000 to $50,0000, this is a non-trivial difference. Two possible reasons for this difference include:

- The relationship between the features and response is not perfectly linear (this is most certainly true).
- Some of the features we included were not actually correlated with the median price. Adding additional complexity without improving the model can lead to what is known as overfitting, where the model performs well on the training data but does not generalize well to new data.

There are other potential sources of error, but those have to do with the specific assumptions regarding linear regression models (such a multicollinearity between features, the presence of heteroscedasticity, and the distribution of the residuals between the predicted and actual prices) and are beyond the scope of this tutorial. However, there is one plot we can make which will give us a sense of how well our model fit the data, and that is a plot of the predicted versus actual home prices. The red line has a slope of one, and represents the line where the predicted price would be identical to the actual price.

```
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.plot([,50000],[,50000],'r',lw=2)
plt.xlabel('Actual Price (Dollars)')
plt.ylabel('Predicted Price (Dollars)')
plt.show()
```

If our model had a test RMSE of zero, we would expect every blue dot to land perfectly on the red line. This is not the case of course, and this plot tells us that our model tends to under predict home prices at the lower and higher ends of the price range, while prices in the middle are somewhat equally distributed above and below the perfect fit line.

Overall, our model did a satisfactory job of predicting the median home price, especially for a first effort. Plus, we learned which features are most influential, and which contribute to an increase or decrease in median home price, which can be just as valuable as being able to predict the price itself.

## Classification Example

Our classification example will use the UCI ML Breast Cancer Wisconsin dataset, and our goal is to predict the whether or not a mass is benign or malignant given a set of features based on a digital image of the mass. Because our goal is interpretability, we'll use logistic regression as our model of choice. As was the case with linear regression, despite being one of the older supervised learning methods, it is still useful, and quite widely used.

Given that the name of this model is similar to linear regression, you'd be right to think that there is some similarity between the two. In this case, rather than predicting a quantitative output, we are predicting a qualitative one, specifically whether a mass of cells is benign or malignant. If we designate each one of these labels as a class, with values of 0 for malignant and 1 for benign, what we'd really like is for our model to return a probability of each mass belonging to the benign class. That is, we'd like it to output

Where the right hand side is the conditional probability that the value of the label is equal to 1 (i.e., benign) given the particular features of the sample.

If we once again have *p *features, we can try to use the linear regression model from the previous example, in which case we end up with

The problem here is that we need our estimates to be valid probabilities (i.e., between 0 and 1), but as we saw, the right hand side outputs continuous values over a wide range. What we need is a function that will always return values between 0 and 1, and the logistic function does just that. The logistic function is defined as

A plot of it is shown below:

```
x = np.linspace(-20,20,100)
y = 1/(1+np.exp(-x))
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('The Logistic Function')
plt.show()
```

This function is perfect for us, as large negative values get mapped to zero, and large positive values get mapped to one. Given that, the logistic regression model is given as

As in the previous example, the *β *terms are unknown coefficients that will be determined by our specific data set.

The default threshold for classification is p(X) = 0.5, and note that in the graph above, y, or p(X), is 0.5 when x is equal to 0. This means that when the expression inside the exponent of the equation above is greater than 0 (corresponding to p(X) > 0.5), the mass is classified as benign (class 1). When it's less than zero (corresponding to p(X) < 0.5), the mass is classified as malignant (class 0). This default threshold may not always be appropriate, for example in this case you may want to classify masses as malignant using a lower threshold. For this example however, we're going to stick with the default.

Before we import our data, let's address the same two questions we asked in the regression example.

### How are the Β's determined?

In this case, the coefficients are determined using a method called *maximum likelihoo**d*. Although the details are beyond the scope of this tutorial, the method works by finding the values of the *β*'s such that the output of the model is close to zero for all malignant class examples, and close to one for all benign class examples. The *β*'s are chosen such that they maximize what is known as the *likelihood function*.

How do we measure the accuracy of our model?

It was relatively straightforward to determine the accuracy of our model in the regression example. For classification, things get a bit more complicated. There are many different ways to measure the accuracy of a classifier, which metric to use depends on the specific problem. The most basic approach is to measure the error rate, which is simply the percentage of correct classifications

Here, *n* again refers to the number of samples in our data set. The function inside the summation simply counts the number of samples for which the class was correctly predicted. Dividing by the number of samples converts this count into a fraction, which can be interpreted as the accuracy of the classifier.

Another way to measure accuracy requires a more specific definition of correct and incorrect predictions. Consider the following terms:

- A true positive classification is one where we correctly predicted that a sample belonged to the positive class (in this case, we'll call the malignant class positive).
- A true negative classification is one where we correctly predicted that a sample belonged to the negative class (in this case, we'll call the benign class negative).
- A false positive classification is one where we incorrectly predicted that a sample belonged to the positive class (in this case, we said the mass was malignant when it was actually benign).
- A false negative classification is one where we incorrectly predicted that a sample belonged to the negative class (in this case, we said the mass was benign when it was actually malignant).

Depending on the problem, you may be more concerned with tracking the number of false positives or false negatives, rather than the overall accuracy. The accuracy metric assumes that true positive and true negative classifications are equally important. In many cases, including fraud detection and cancer diagnoses, false negatives are much more dangerous than false positives.

With these new terms defined, we can compute what is known as the confusion matrix for our classifier. For a binary classifier such as the one we're going to create, the confusion matrix lists the total count of each of the four types of classifications after a set of predictions has been made. From there, a variety of metrics can be calculated depending on the problem.

```
from sklearn.datasets import load_breast_cancer
breast_cancer_data = load_breast_cancer()
```

As before, let's turn the data set into a pandas data frame, and label the columns using the 'feature_names' property of the dataset.

```
bc = pd.DataFrame(breast_cancer_data.data)
bc.columns = breast_cancer_data.feature_names
```

We can use the shape() function to see the size of the data frame.

`bc.shape`

(569, 30)

For this data set, each row represents a different digital image of a mass, and there are 569 total images in the dataset. Each column represents a different feature, and there are 30 features for each mass. We can look at the first few rows in the data frame using the head() function.

`bc.head()`

mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | |

1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |

2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |

3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |

4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |

5 rows × 30 columns

As noted earlier, there are 30 features for each mass. These features relate to the description of the mass based on the digital image. Some of the features describe characteristics like:

- Radius
- Texture
- Perimeter
- Area
- Smoothness
- Symmetry

As before, the labels (class) are not one of the features. We can add a new column to our data frame using the same method as before.

`bc['class'] = breast_cancer_data.target`

Let's take a look at the class counts, which corresponds to the number of benign and malignant masses. We can do this using the 'value_counts' function within pandas.

`pd.value_counts(bc['class'])`

1 357 0 212 Name: class, dtype: int64

There are 212 malignant masses (class 0), and 357 benign masses (class 1) in our data set. We need to be careful when calculating our confusion matrix, in terms of which class is considered positive and negative, but we'll address that when the time comes.

Let's look at some basic statistics, using the describe() function as before.

`bc.describe().transpose()`

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

mean radius | 569.0 | 14.127292 | 3.524049 | 6.981000 | 11.700000 | 13.370000 | 15.780000 | 28.11000 |

mean texture | 569.0 | 19.289649 | 4.301036 | 9.710000 | 16.170000 | 18.840000 | 21.800000 | 39.28000 |

mean perimeter | 569.0 | 91.969033 | 24.298981 | 43.790000 | 75.170000 | 86.240000 | 104.100000 | 188.50000 |

mean area | 569.0 | 654.889104 | 351.914129 | 143.500000 | 420.300000 | 551.100000 | 782.700000 | 2501.00000 |

mean smoothness | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.086370 | 0.095870 | 0.105300 | 0.16340 |

mean compactness | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.064920 | 0.092630 | 0.130400 | 0.34540 |

mean concavity | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.029560 | 0.061540 | 0.130700 | 0.42680 |

mean concave points | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.020310 | 0.033500 | 0.074000 | 0.20120 |

mean symmetry | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.161900 | 0.179200 | 0.195700 | 0.30400 |

mean fractal dimension | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.057700 | 0.061540 | 0.066120 | 0.09744 |

radius error | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.232400 | 0.324200 | 0.478900 | 2.87300 |

texture error | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.833900 | 1.108000 | 1.474000 | 4.88500 |

perimeter error | 569.0 | 2.866059 | 2.021855 | 0.757000 | 1.606000 | 2.287000 | 3.357000 | 21.98000 |

area error | 569.0 | 40.337079 | 45.491006 | 6.802000 | 17.850000 | 24.530000 | 45.190000 | 542.20000 |

smoothness error | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.006380 | 0.008146 | 0.03113 |

compactness error | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.013080 | 0.020450 | 0.032450 | 0.13540 |

concavity error | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.015090 | 0.025890 | 0.042050 | 0.39600 |

concave points error | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.007638 | 0.010930 | 0.014710 | 0.05279 |

symmetry error | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.015160 | 0.018730 | 0.023480 | 0.07895 |

fractal dimension error | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |

worst radius | 569.0 | 16.269190 | 4.833242 | 7.930000 | 13.010000 | 14.970000 | 18.790000 | 36.04000 |

worst texture | 569.0 | 25.677223 | 6.146258 | 12.020000 | 21.080000 | 25.410000 | 29.720000 | 49.54000 |

worst perimeter | 569.0 | 107.261213 | 33.602542 | 50.410000 | 84.110000 | 97.660000 | 125.400000 | 251.20000 |

worst area | 569.0 | 880.583128 | 569.356993 | 185.200000 | 515.300000 | 686.500000 | 1084.000000 | 4254.00000 |

worst smoothness | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.116600 | 0.131300 | 0.146000 | 0.22260 |

worst compactness | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.147200 | 0.211900 | 0.339100 | 1.05800 |

worst concavity | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.114500 | 0.226700 | 0.382900 | 1.25200 |

worst concave points | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.064930 | 0.099930 | 0.161400 | 0.29100 |

worst symmetry | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.250400 | 0.282200 | 0.317900 | 0.66380 |

worst fractal dimension | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.071460 | 0.080040 | 0.092080 | 0.20750 |

class | 569.0 | 0.627417 | 0.483918 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.00000 |

As with the regression example, many of the features have different scales. This time, it is very important that we scale our features. The reason has to do with the specifics of the logistic regression model in scikit-learn. The model performs regularization by default, which is sensitive to the relative values of the coefficients, and helps control overfitting, which was described during the regression section. Before we scale the features, let's briefly discuss classifier decision boundaries.

### Classifier Decision Boundaries

As discussed above, the ultimate goal in classification is to correctly predict which class each sample belongs to. This is equivalent to defining a geometric boundary where samples are classified depending on which side of the boundary they fall. This can be made more clear using an example from our data set. Consider the figure below, which plots the 'mean radius' feature with each sample colored by class (benign samples are yellow, malignant are purple).

```
x = range(len(bc['mean radius']))
y = bc['mean radius']
plt.scatter(x,y,c=bc['class'])
plt.xlabel('sample')
plt.ylabel('mean radius')
plt.show()
```

If we were trying to classify a sample using just this feature, a good boundary could be a value of 12.5 for the mean radius. Any sample with a mean radius less than 12.5 is classified as benign, and any sample with a mean radius greater than 12.5 is classified as malignant. It's not a perfect classification, but it's a start. Since we're using not one but 30 features, our classifier will create a analogous boundary in higher dimensional space to separate the samples. It could be that not all features are useful in separating the classes, which is something we could investigate once our model has been built.

```
X = bc.iloc[:,:-1]
y = bc['class']
```

```
# Split the data into 80% training and 20% testing.
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)
print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape)
```

```
# Split the data into 80% training and 20% testing.
# The random_state allows us to make the same random split every time.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=327)
print('Training data size: (%i,%i)' % X_train.shape)
print('Testing data size: (%i,%i)' % X_test.shape)
```

Training data size: (455,30) Testing data size: (114,30)

```
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

Let's double check to make sure the scaling worked as intended.

```
print('Training set mean by feature:')
print(X_train.mean(axis=))
print('Training set standard deviation by feature:')
print(X_train.std(axis=))
```

Training set mean by feature: [ -2.75628116e-15 4.59705534e-16 -5.40227204e-16 6.52957542e-16 6.25189766e-15 -3.13644105e-15 -2.56205313e-16 -1.11607915e-15 -3.54783358e-15 3.99192279e-15 8.92387507e-16 -4.62389589e-16 -1.18725237e-15 -9.41005515e-16 1.29859493e-15 1.24088773e-15 7.30795156e-16 -1.70315532e-16 -2.79580998e-15 -3.99436284e-16 -1.29859493e-15 4.74785046e-15 -8.00824608e-16 -5.39495188e-16 4.93427033e-15 1.86383265e-15 -9.58451877e-16 -1.65923441e-17 -2.49739179e-15 -1.38546073e-15] Training set standard deviation by feature: [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

```
from sklearn.linear_model import LogisticRegression
regression_model = LogisticRegression()
regression_model.fit(X_train,y_train)
```

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

```
intercept = regression_model.intercept_
coef = pd.DataFrame(regression_model.coef_.transpose(),
index=breast_cancer_data.feature_names,
columns=['Coefficients'])
print('Intercept = %f\n' % intercept)
print(coef)
```

Intercept = 0.493176 Coefficients mean radius -0.620034 mean texture -0.504767 mean perimeter -0.564625 mean area -0.646708 mean smoothness -0.131364 mean compactness 0.460890 mean concavity -0.800368 mean concave points -0.771655 mean symmetry 0.168892 mean fractal dimension 0.553119 radius error -1.144201 texture error 0.337015 perimeter error -0.569919 area error -0.946128 smoothness error -0.405995 compactness error 0.610680 concavity error -0.154262 concave points error 0.078581 symmetry error 0.249923 fractal dimension error 0.731013 worst radius -1.150628 worst texture -1.192783 worst perimeter -0.880301 worst area -1.086891 worst smoothness -0.994301 worst compactness -0.007784 worst concavity -0.752380 worst concave points -0.752560 worst symmetry -0.703783 worst fractal dimension -0.449378

We have to be a bit more careful when interpreting the coefficients of a logistic regression model. We want to be able to articulate the effect of each coefficient on the output, and recall that our model outputs a probability of the form

In this form, it is hard to specify the effect of each coefficient on the output probability. What we really want is an expression involving only the coefficients and features on the right hand side. With a bit of manipulation, we can express the equation for our model as

The quantity on the left hand side is called the odds, and is defined as the ratio between the probability of a given event occurring and not occurring. Unlike probability, odds can range from 0 to infinity. Events with higher probabilities have higher odds, and vice versa. If we take the log of the above equation, we arrive at the following

The expression on the left hand side is called the *log-odds*, and this is what we were after, as the right side involves only the coefficients and features. Now, we see how to properly interpret the effect of each coefficient on the output. We can say that a one-unit increase in a given feature (while holding all other features fixed) increases the log-odds by an amount equal to its particular coefficient.

Although the actual change in p(X) caused by a one-unit increase in a particular feature is harder to quantify, we can still interpret the sign of the coefficient in the same manner as before. That is, a positive coefficient is associated with increasing the value of p(X) as X increases, and likewise a negative coefficient is associated with decreasing the value of p(X) as X increases. Recall that since malignant masses are coded as 0, and benign as 1, an increase in p(X) corresponds with a higher chance of the mass being benign. Give that, let's take a closer look at some of our coefficients:

- Most coefficients related to measures of size (radius, perimeter, area) are negative, indicating that an increase is related to an increased risk of a mass being classified as malignant. This suggests that larger masses may be more often malignant, which makes sense.
- The feature corresponding to compactness is positive, indicating that an increase in compactness (i.e., a smaller mass), is related to a decreased risk of a mass being classified as malignant. This also makes sense.

### Testing the Model on New Data

Now that we've built our model, we can check its performance on the test data set we set aside earlier. As before, we will do that by using the 'predict' function. In addition, let's compute the error rate (i.e., accuracy) on the test data using the formula shown earlier. Not surprisingly, scikit-learn has a built in function for that as well. In the code below, 'y_pred' contains the predicted class, and 'y_test' contains the true class (labels) from the test data set.

```
from sklearn.metrics import accuracy_score
y_pred = regression_model.predict(X_test)
test_acc = accuracy_score(y_test,y_pred)*100
print('The test set accuracy is %4.2f%%' % test_acc)
```

The test set accuracy is 96.49%

This is an impressive result, over 95% accuracy using a relatively simple model with the default parameters. Recall our discussion earlier regarding the different ways of measuring accuracy in a classification model. We introduced something known as a confusion matrix, so let's go ahead and print it below using the 'confusion_matrix' function from scikit-learn's 'metrics' module.

```
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test,y_pred, labels=[1,])
print(conf_matrix)
```

[[66 1] [ 3 44]]

The confusion matrix is interpreted as follows:

- The upper left term contains number of true negatives in the test set. A true negative is where both the predicted and actual label was negative (benign). In our case, there were 66 true negatives.
- The lower right term contains number of true positives in the test set. A true positive is where both the predicted and actual label was positive (malignant). In our case, there were 44 true positives.
- The lower left term contains the number of false negatives in the test set. A false negative is where the predicted label was negative (benign), but the actual label was positive (malignant). In our case, there were 3 false negatives, which are quite dangerous in this context.
- The upper right term contains the number of false positives in the test set. A false positive is where the predicted label was positive (malignant), but the actual label was negative (benign). In our case, there was only 1 false positive. While stressful for a patient (until repeated tests can be performed), false positives are far less dangerous than false negatives.

Let's assign the various components to variables and calculate a few more metrics.

```
# True negatives
TN = conf_matrix[][]
# True positives
TP = conf_matrix[1][1]
# False negatives
FN = conf_matrix[1][]
# False positives
FP = conf_matrix[][1]
```

The sensitivity, also known as recall, or true positive rate (TPR), is given as

A way to interpret the sensitivity is that, out of all positive results (a combination of true positive and false negatives), how many did we correctly predict? In our case, the sensitivity is equal to

```
TPR = float(TP)/(TP+FN)
print('TPR = %4.2f%%' % (TPR*100))
```

TPR = 93.62%

The specificity, also known as the true negative rate (TNR), is given as

A way to interpret the sensitivity is that, out of all negative results (a combination of true negatives and false positives), how many did we correctly predict? In our case, the specificity is equal to

```
TNR = float(TN)/(TN+FP)
print('TNR = %4.2f%%' % (TNR*100))
```

TNR = 98.51%

The precision, or positive predictive value (PPV), is given as

A way to interpret the precision is that, out of all the results we said were positive (a combination of true positives and false positives), how many did we correctly predict? In our case, the precision is equal to

```
PPV = float(TP)/(TP+FP)
print('PPV = %4.2f%%' % (PPV*100))
```

PPV = 97.78%

Finally, the negative predictive value (NPV), is given as

A way to interpret the NPV is that, out of all the results we said were negative (a combination of true negatives and false negatives), how many did we correctly predict? In our case, the NPV is equal to

```
NPV = float(TN)/(TN+FN)
print('NPV = %4.2f%%' % (NPV*100))
```

NPV = 95.65%

As you can see, calculating the accuracy of the model doesn't always tell the whole story. Because we had three false negatives, our TPR and NPV were lower than our accuracy. On the other hand, because we only had one false positive, our TNR and PPV were higher. Some tweaking of the model could help adjust these values, for example it may be desired to decrease the number of false negatives. This could be done by adjusting the parameters of the logistic regression model (we used the defaults), and the decision threshold itself (recall that the default probability threshold of 0.5 may not be appropriate in this case).

## Conclusion

Hopefully this tutorial served as a good introduction to supervised learning with Python. We were introduced to the scikit-learn package, which provides a lot of very powerful machine learning related functionality, and is a great place to start experimenting. We also saw the pandas package, which supports flexible data structures and is designed to make working with datasets easy. Along the way, we learned a bit of theory regarding model selection, some tradeoffs to consider when using different models, the importance of keeping separate training and testing sets, and why it's usually a good idea to scale your data. In addition, we got to see examples of both regression and classification problems, ways to evaluate the performance of each, and how the results could be improved.