Prerequisites

Experience with the topic: Novice

Professional experience: None required

Programming experience: Novice level experience with Python

Python version: 3

Objectives

By the end of this tutorial, readers will learn about the following:

Introduction

Random forest is an ensemble machine learning algorithm that is used for classification and regression problems. Random forest applies the technique of bagging (bootstrap aggregating) to decision tree learners. There are many reasons why random forest is so popular (it was the most popular machine learning algorithm amongst Kagglers until XGBoost took over). These reasons are:

  1. Ensemble learning prevents overfitting of data
  2. Bootstrapping enables random forest to work well on relatively small datasets
  3. Predictors can be trained in parallel
  4. Decision tree learning enables automatic feature selection

To understand how random forest works, it is important to understand the techniques that the algorithm is comprised of. I explain them below.

Bagging

Bagging is a method of generating new datasets from existing data by creating samples of the existing data with replacement. This means there could be repeated values in each of the newly created datasets.

Bagging is the magic that makes random forest popular because it avoids overfitting, despite increasing the number of trees. This is because it averages many low-bias and high-variance predictors, thereby reducing the variance without increasing bias. Consequently, random forests can achieve high accuracy without the risk of overfitting or underfitting data. Also, since multiple versions of the dataset are generated, it is possible to work with relatively small datasets.

Decision Tree Learning

A decision tree learning algorithm can be used for classification or regression problems to help predict an outcome based on input variables. Decision trees are made of:

  • A root: The feature that best describes the dataset. This attribute is selected by calculating the Gini index or Information Gain of all the features. There is only one root.
  • Nodes: Splitting points for decisions.
  • Branches: Split description.
  • Leaves: Final-level nodes that cannot be further split.

Below is a decision tree based on the data that will be used in this tutorial. As we will see later when we build the random forest model, question A5 is the strongest feature in the dataset. This is confirmed by the decision tree in the image:

decision-tree-learning-1

Random forest is an ensemble decision tree algorithm because the final prediction, in the case of a regression problem, is an average of the predictions of each individual decision tree; in classification, it's the average of the most frequent prediction. So, the algorithm takes the average of many decision trees to arrive at a final prediction, as shown in the image above.

When Should You Use Random Forest?

Random forest is a good option for regression and best known for its performance in classification problems. Furthermore, it is a relatively easy model to build and doesn’t require much hyperparameter tuning. This is because the main hyperparameters are the number of trees in the forest and the number of features to split at each leaf node.

Business Uses

Random forest has numerous business use cases for classification and regression problems. Some of the business use cases are:

Step 1: Load Python packages

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.io import arff
import pandas as pd

Step 2: Pre-Process the data

For this tutorial, we will be using the ‘Autistic Spectrum Disorder Screening Data for Adolescent Data Set’ from the University of California, Irvine.

Let’s load the dataset and begin exploring the parameters.

data = arff.loadarff('/Autism-Adolescent-Data.arff')
df = pd.DataFrame(data[0])
list(df)
df.head()

 Screen Shot 2019-01-30 at 1.51.22 PM

Screen Shot 2019-01-30 at 1.52.19 PM

The above image is snapshot of what the first five rows of the data looks like. The questions in the data have been one-hot encoded, but this data does include text. For this project, we will be building the model based on the 10 questions asked in the survey, along with gender, whether the child ever had jaundice, and whether anyone in the family has a learning disorder or not. The final column is the result of the survey — whether the adolescent has Autism or not.

This is a relatively small dataset, so random forest is the perfect model because it uses bagging, which I explained earlier in this tutorial. Before we build the model, we need to make some changes to the data in order to make it ready for the model.

Let’s begin with lowercasing and one-hot encoding the categorical variables so that we can turn the categorical variables to numeric. Let’s make the following changes:

  • Lowercase all the text
  • All ‘yes’ = 1
  • All ‘no’ = 0
  • Female = 1
  • Male = 0

df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('yes', 1)
df = df.replace('no', 0)
df = df.replace('f', 1)
df = df.replace('m', 0)

Step 3: Subset the data

Our new dataset should only have the variables that we will be using to build the model.


xVar = list(df.loc[:,'A1_Score':'A10_Score']) + ['gender'] + ['jundice'] + ['austim']
yVar = df.iloc[:,20]
df2 = df[xVar]

Step 4: Split the data into train and test sets

We will build the model on the training set and check the accuracy of the model by using it on the testing set.


X_train, X_test, y_train, y_test = train_test_split(df2, yVar, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(83, 13) (83,)
(21, 13) (21,)

In the code above, the data is split in a way that 80% of the variables fall under the training set and 20% of the variables are used for testing the model. Our resulting training set has 83 observations and the testing set has 21 observations.

Step 5: Build a Random Forest Classifier

Next, let’s build the random forest classifier and fit the model to the data. We will use the default parameters for the model.


clf = RandomForestClassifier(n_jobs=2, random_state=0)

clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=2, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

Step 6: Predict

Our model is only as good as its predictions, so let’s use it to predict Autism in the test set.


preds = clf.predict(X_test)

Step 7: Check the Accuracy of the Model

Now, to check the accuracy of the model, we will check how the predictions stack up against the actual test set values. A confusion matrix is one of the methods used to check the accuracy of a classification model.


pd.crosstab(y_test, preds, rownames=['Actual Result'], colnames=['Predicted Result'])

 Screen Shot 2019-01-30 at 2.00.53 PM

As we can see, the model did pretty well! It only classified one observation incorrectly.

Step 8: Check Feature Importance

As a final step, and to tie in the random forest classifier with the decision tree image above, let’s look at the importance of all the features in this dataset.


list(zip(X_train, clf.feature_importances_))

 Screen Shot 2019-01-30 at 2.02.50 PM

And here it is! Just like the decision tree diagram above, question A5 is the most important feature in our data set!

Conclusions

As is clear from this tutorial, random forest is a very easy model to build and can produce quite impressive results. Of course, the data used in this model is not as complicated and does not require as much pre-processing as a bigger dataset would. However, this post should provide a good first step in helping you understanding how random forest works very well with limited data.

Sahiba Chopra
Author
Sahiba Chopra

Sahiba is a data scientist with 4 years of experience. She has used data to build solutions for companies across numerous industries including renewable energy, entertainment and microfinance. Sahiba is currently leading the data science team at a microfinance company in Mumbai, India.