Let's load the data into a `Pandas`

dataframe using `urlopen`

from the `urllib.request`

module.

Instead of downloading a `csv`

, I grabbed the data straight from the UCI Machine Learning Database using an http request, a method inspired by Python tutorials from the University of California, Santa Barbara's data science course.

```
# Loading data and cleaning dataset
UCI_data_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases\
/breast-cancer-wisconsin/wdbc.data'
```

I recommend that you keep a static file for your data set as well.

Now, create a list with the appropriate names and set them as the data frame's column names. Then, load them into a pandas data frame.

```
names = ['id_number', 'diagnosis', 'radius_mean',
'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean','concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se',
'concavity_se', 'concave_points_se',
'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst',
'concavity_worst', 'concave_points_worst',
'symmetry_worst', 'fractal_dimension_worst']
dx = ['Benign', 'Malignant']
```

`breast_cancer = pd.read_csv(urlopen(UCI_data_URL), names=names)`

## Cleaning

You'll need to do some minor cleaning, such as setting the `id_number`

to the data frame index and converting the `diagnosis`

to the standard binary 1, 0 representation using the `map()`

function.

```
# Setting 'id_number' as our index
breast_cancer.set_index(['id_number'], inplace = True)
# Converted to binary to help later on with models and plots
breast_cancer['diagnosis'] = breast_cancer['diagnosis'].map({'M':1, 'B':0})
```

## Missing Values

Given context of the data set, I know there is no missing data. Even so, let's run a for-loop to see if there are any missing values in our columns. Below, you'll see the column name and total missing values for that column.

```
for col in breast_cancer:
if ((breast_cancer[col].isnull().values.ravel().sum()) == 0):
pass
else:
print(col)
print((breast_cancer[col].isnull().values.ravel().sum()))
print('Sanity Check! \No missing Values found!')
```

Sanity check! No missing values found! We'll use this for the random forest model, where the `id_number`

won't be relevant.

```
# For later use in CART models
names_index = names[2:]
```

Let's preview the data set utilizing the `head()`

function, which will provide the first five values of our data frame.

`breast_cancer.head()`

Next, let's get the dimensions of the data set. The first value is the number of patients and the second value is the number of features. It's important to print the data types of your data set, as it will often be an indicator of missing data — and provide context for additional data cleaning.

```
print("Here's the dimensions of our data frame:\n",
breast_cancer.shape)
print("Here's the data types of our columns:\n",
breast_cancer.dtypes)
```

Here are the dimensions of our data frame: (569, 31)

Here are the data types of our columns:

```
diagnosis int64\n",
"radius_mean float64\n",
"texture_mean float64\n",
"perimeter_mean float64\n",
"area_mean float64\n",
"smoothness_mean float64\n",
"compactness_mean float64\n",
"concavity_mean float64\n",
"concave_points_mean float64\n",
"symmetry_mean float64\n",
"fractal_dimension_mean float64\n",
"radius_se float64\n",
"texture_se float64\n",
"perimeter_se float64\n",
"area_se float64\n",
"smoothness_se float64\n",
"compactness_se float64\n",
"concavity_se float64\n",
"concave_points_se float64\n",
"symmetry_se float64\n",
"fractal_dimension_se float64\n",
"radius_worst float64\n",
"texture_worst float64\n",
"perimeter_worst float64\n",
"area_worst float64\n",
"smoothness_worst float64\n",
"compactness_worst float64\n",
"concavity_worst float64\n",
"concave_points_worst float64\n",
"symmetry_worst float64\n",
"fractal_dimension_worst float64\n",
"dtype: object\n"
```

## Class Imbalance

The distribution of diagnoses is important because it speaks to class imbalance within machine learning and data mining applications. Class imbalance is a term used to describe when a target class within a data set is outnumbered by another target class (or classes). This can create misleading accuracy metrics, known as an accuracy paradox. To make sure our target classes aren't imbalanced, create a function that will output the distribution of the target classes.

**Note:** If your data set suffers from class imbalance, I suggest reading up on upsampling and downsampling.

```
def print_dx_perc(data_frame, col):
"""Function used to print class distribution for our data set"""
dx_vals = data_frame[col].value_counts()
dx_vals = dx_vals.reset_index()
# Create a function to output the percentage
f = lambda x, y: 100 * (x / sum(y))
for i in range(0, len(dx)):
print('{0} accounts for {1:.2f}% of the diagnosis class'\
.format(dx[i], f(dx_vals[col].iloc[i],
dx_vals[col])))
```

`print_dx_perc(breast_cancer, 'diagnosis')`

Benign results account for 62.74% of the diagnosis class. Malignant results account for 37.26% of the diagnosis class. Fortunately, this data set does not suffer from class imbalance.

Next, we will employ a function that gives us standard descriptive statistics for each feature including mean, standard deviation, minimum value, maximum value, and range intervals.

`breast_cancer.describe()`

You can see in the maximum row of the chart that our data varies in distribution; this will be important as we consider classification models. Standardization is an important requirement for many classification models that should be handled when implementing pre-processing. Some models (like neural networks) can perform poorly if pre-processing isn't considered, so the `describe()`

function is a good indicator for standardization. Fortunately, random forest does not require any pre-processing. (For use of categorical data, see sklearn's Encoding Categorical Data.)