# Prerequisites

Previous knowledge of PyTorch or transfer learning is not required, but the reader should be familiar with basic concepts of deep learning and the neural network, as well as some basic terminology of convolutional neural networks.

To follow along, the reader should have Anaconda for Python 3.x and Jupyter Notebook installed. A commodity Nvidia GPU (GTX 1050 Ti, or a flavor of 1060,1070 or 1080) either locally or rented from the cloud would be preferable. If you work with CPU only, it may take significantly more time (in order of days) to achieve the same results.

The complete code that is used in this tutorial is available on GitHub here. The classes that I have shown in the cells below are also available in a separate folder called mylib.

# Introduction to PyTorch

PyTorch is a relatively new deep learning framework from Facebook that is quickly gaining popularity in the research as well as developer community. Its primary merits are flexibility and more control over the run-time behavior (sometimes called dynamic behavior) of a neural network. Another key advantage is its use of common Python programming patterns and practices, unlike TensorFlow which defines its own special kind of syntax and programming style on top of Python that makes it somewhat harder to learn for newcomers.

At the time of this writing, PyTorch 1.0 has just been released as beta which is a major upgrade in terms of model deployment in the real world. It introduces a Just-In-Time (JIT) graph compiler through a mechanism called Torch Script that makes it more efficient to deploy a model for prediction. However, in this tutorial we shall be using version 0.4.1 which was the latest version before this major upgrade.

# Introduction to Transfer Learning

When we take a model created and trained elsewhere on a similar problem that we are trying to solve, and reuse its architecture and (possibly) its weights in our setting, we are applying transfer learning. It means that somebody trained a neural network model, on most likely a very large dataset, and put that pre-trained model in a model repository. We take that model and modify it a little bit to adapt it to our use case, thus transferring the learning achieved by that model previously to our application, without having to retrain it from scratch. This not only saves time but also transfers the ”knowledge” of the model to our case, which usually results in achieving very high accuracy.

Essentially, we are building on other people’s work who make it available for the greater good. It’s a great step towards democratization of deep learning and artificial intelligence in general. Transfer learning is a highly effective technique used throughout the world by deep learning practitioners today. It is most effective when the use case is well understood and the data is ”fixed.” For example, image classification and object detection, which are based on just pixels, or natural language processing (NLP) text corpuses, which are words out of a large vocabulary.

It may not be that effective for structured or tabular data used in business settings, e.g. data collected from databases and files because one company’s data may be quite different in structure and semantics from another. However, even that is changing now with a recent trend in the use of categorical embeddings just like word embeddings used in NLP. Such embeddings allow us to transfer the learning achieved through data of one organization for a specific domain (e.g. predicting retail sales) to similar problems of others in the same domain.

# Image Classification Use Cases

Image classification is the core building block of several complex applications such as object detection, image captioning, face recognition, and image segmentation, to name a few. Features extracted from images during classification can be effectively used in several use cases and applications related to computer vision.

In this tutorial, we provide a step-by-step guide to applying transfer learning in PyTorch on an image classification problem. The problem is to automatically classify objects present in images into categories, e.g. bird, plane, dog, cat, etc.

# The Dataset

We will be using the CIFAR-10 dataset, which consists of 60,000 images categorized into 10 classes. Each image is of size 28x28 pixels. This is one of the more difficult datasets for classification because the images are small and somewhat blurry (low resolution). Some of the available benchmarks for this dataset are given here.

In 2014, there was a Kaggle competition on the CIFAR-10 dataset and the results are available here.

# Objectives

At the end of this tutorial, the reader should be able to:

• Create an API (set of classes and utility functions) with PyTorch to preprocess and prepare any image dataset for training, evaluation, and prediction.
• Construct and use an API to effectively apply transfer learning in PyTorch on an image dataset for classification.
• Acquire some tips and tricks to achieve very high accuracy on CIFAR-10 using three different, freely available pre-trained models by combining them effectively to achieve higher accuracy than the individual models.
• Know how to create classes for deep learning tasks with PyTorch and use them as components in other applications.

# State-of-the-Art Results

While preparing this tutorial, my accuracy (94.7%) ended up in third place on both the benchmark site as well as Kaggle scoring (of course through late submission). This was achieved in less than 2 hours of training altogether (all three models combined) on a commodity Nvidia GTX 1070 GPU.

I will show you some simple tips and tricks to increase the accuracy of your models with transfer learning and ensemble different models together to achieve even higher accuracy in most applications.

# Create a PyTorch Dataset for CIFAR-10

In [1]:

import torch ## for pytorch
import torchvision ## for transfer learnhing models and many other vision related classes
from torch import nn ## Core Neural Network Model classes in Pytorch
from torch import optim ## Contains several Pytorch optimizer classes
import torch.nn.functional as F ## Contains several utilily functions provided by Pytorch

from torchvision import datasets, transforms, models ## Many Computer Vision related classes
## for datasets and transformations etc.
from torch.utils.data import * ## Contains several utilily functions for dataset manipulation
from PIL import Image
import numpy as np

In [ ]:

## The following imports contain classes and functions that we develop throughout this tutorial. They have
## been explained throughout this tutorial.

from mylib.utils import *
from mylib.model import *
from mylib.cv_model import *
from mylib.fc import *
from mylib.chkpoint import *
from mylib.cv_data import *

## The following two lines are for reloading any imported files if they are modified while
## our Jupyter Notebook is running
%autoreload 2

In [ ]:

train_dataset = datasets.CIFAR10('Cifar10', train=True,

test_dataset = datasets.CIFAR10('Cifar10', train=False,
download=True)

This gives us two dataset objects that are of torchvision.datasets.cifar.CIFAR10 type. This is a subclass of the PyTorch dataset class, which is the main class to generically represent any dataset. This particular class represents the CIFAR-10 data stored in its internal data structure. Later, these objects shall be passed to a PyTorch Dataloader objects (explained later) for processing the images.

We can verify the lengths (number of images) of both datasets

In [3]:

len(train_dataset),len(test_dataset)

Out[3]: (50000, 10000)

As you can see above, we have 50,000 and 10,000 images in training and test sets respectively.

## A Quick Refresher of Tensors

Tensors are just a way of representing n-dimensional data objects of a single type (integers or float, etc.) in a generic way. For example:

• A single value (integer or float) is a 0-dimensional tensor
• An array with N elements is a one-dimensional tensor
• A matrix with M rows and N columns is a 2-dimensional tensor (MxN)
• An MxN image with three RGB (Red, Green, Blue) color channels represented by three matrices is a three dimensional tensor (3 x M x N)

The image tensors are contained in the field train_data within the dataset object. Let's look at the shape of one of the tensors representing an image

In [5]:

train_dataset.train_data[0].shape

Out[5]:

(32, 32, 3)

This tells us that our images are of 32 x 32 in size with 3 color channels.

Let's look at some of the images using matplotlib.plyplot module

In [6]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(train_dataset.train_data[100])

Out[6]:

<matplotlib.image.AxesImage at 0x7f71447fa7f0>

This looks like a ship. As you can observe, the images are rather blurry and quite low resolution (32 x 32).

# Preprocess the Dataset and Prepare It for Training

Understand the concept of DataLoader and the PyTorch DataLoader API. Split the images into train, validation, and test sets. Create PyTorch DataLoaders to feed images while training, validation, and prediction. Use PyTorch API to define transforms for preprocessing the dataset for more effective training. Use PyTorch API to convert all images to PyTorch tensors. Normalize the dataset using mean and standard deviation of images.

PyTorch DataLoaders are objects that act as Python generators. They supply data in chunks or batches while training and validation. We can instantiate DataLoader objects and pass our datasets to them. DataLoaders store the dataset objects internally.

When the application asks for the next batch of data, a DataLoader uses its stored dataset as a Python iterator to get the next element (row or image in our case) of data. Then it aggregates a batch worth of data and returns it to the application.

The following is an example of calling the DataLoader constructor:

In [4]:

num_train = len(train_dataset)
indices = list(range(num_train))
num_workers=0)

In [5]:

len(train_loader)

Out[5]:

1000

Here we are creating a DataLoader object for our training dataset with a batch size of 50. The sampler parameter specifies the strategy with which we want to sample data while constructing batches.

We have different samplers available in torch.utils.data.sampler. The explanation is straightforward. You can read about them in the Pytorch Documentation here.

The num_workers argument specifies how many processes (or cores) we want to use while loading our data. This provides parallelism while loading large datasets. Default is 0 which means load all data in main process.

DataLoader reports its length in number of batches. Since we created this DataLoader with a batch size of 50 and we had 50,000 images in our train dataset, we have the length of dataloader = 1000 batches.

### Splitting Data

Now let's write a function to split our datasets into train, validation, and test sets, and create their corresponding DataLoaders.

In [9]:

def split_image_data(train_data,
test_data=None,
batch_size=20,
num_workers=0,
valid_size=0.2,
sampler=SubsetRandomSampler):

num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]
train_sampler = sampler(train_idx)
valid_sampler = sampler(valid_idx)
if test_data is not None:
num_workers=num_workers)
else:
train_idx, test_idx = train_idx[split:],train_idx[:split]
train_sampler = sampler(train_idx)
test_sampler = sampler(test_idx)

batch_size=batch_size,
sampler=test_sampler,
num_workers=num_workers)

batch_size=batch_size,
sampler=train_sampler,
num_workers=num_workers)

batch_size=batch_size,
sampler=valid_sampler,
num_workers=num_workers)

return train_loader,valid_loader,test_loader

In the above function, test_data can be None in which case it splits train data into train, test, and validation. If test_data is given, it just splits the train set further into train and validation and creates a separate DataLoader from the test set. The function also uses RandomSubsetSampler to shuffle the train and validation set indices. Let's call this function to obtain our DataLoaders.

In [10]:

trainloader,validloader,testloader = split_image_data(train_dataset,test_dataset,batch_size=50)

len(trainloader),len(testloader),len(validloader)

Out[10]:

(800, 200, 200)

And we have a nice split with 800 batches in our train set and 200 each in our validation and test sets respectively.

### Preprocessing and Transforming the Dataset

Before we move on to defining our network and start training, we need to preprocess our datasets. Specifically, we need to perform the following steps:

• Resize the images to an appropriate size for our models
• Perform some basic and most common data augmentation
• Convert the image data to PyTorch Tensors
• Normalize the image data

### Why Do We Want to Resize Images?

Most of our transfer learning models require data to be of at least 224x224 size. The reason for this limitation is that these models are designed with a large number of convolution and pooling layers, finally followed by a fully connected (linear) layer at the end to generate the classification output. By the time the input image reaches the final layer, it has been reduced drastically in size due to the way convolutions and pooling are defined. If the input image was already too small (like 32x32 CIFAR-10 images in our case), it would be too small for the network to produce any significant output. Therefore, these models sort of restrict us to input an image >=224x224.

Please note that we wouldn't have needed resizing if our images were already > 224x224, such as in the case of ImageNet, or if we were to use our own CNN architecture that did not reduce the image size too much while passing it through layers. Resizing smaller images to larger ones (as in our case) creates artifacts that we don't (ideally) want our model to learn. Since our CIFAR-10 images are really small and the transfer learning models we are using have this requirement, we are obliged to resize.

For datasets with larger images, our GPU or CPU memory constraints may become a factor. Therefore, we combine downsizing with increased batch sizes (until we hit the batch size limit) to optimize the model performance and balance the effects of downsizing.

### Data Augmentation

Data augmentation is a common deep learning technique where we modify images on the fly while training the neural network to see additional images flipped or rotated at different axes and angles. This usually results in better training performance since the network sees multiple views of the same image and has a better chance of identifying its class when minimizing the loss function.

Note that the augmented images are not added to the dataset. Rather, they are created during batch generation, so the actual images seen during training will increase even though you don't see the number of images in the datasets increasing. The length and other functions that count the number of images will still give the same answer. We use the two common augmentations below:

• RandomHorizontalFlip that flips some of the images around the vertical axis with a probability p that defaults to 0.5 meaning that 50% of the images shall be flipped.
• RandomRotation at a specific degree (10 in our case below) that rotates some of them randomly at an angle of 10 degree again with a probability of p which defaults to 0.5.

In [ ]:

from torchvision import transforms
train_transform = transforms.Compose([transforms.Resize(224),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
])

In [ ]:

train_dataset = datasets.CIFAR10('Cifar10',download=False,transform=train_transform)

### Data Normalization

In data normalization, we statistically normalize the pixel values in our images. This mostly results in better training performance and faster convergence. A common way to perform normalization is to subtract the mean of pixel values of the whole dataset from each pixel, and then divide by the standard deviation of the pixels of the whole dataset.

The most common way in transfer learning is to use the mean and std values of the dataset that the original transfer learning model was trained on. However, it may be a good strategy for cases where we don't want to retrain any part of the original model.

If our dataset is large and we want to retrain whole or part of the original model, then we would be better off normalizing with the mean and standard deviation of the dataset in question (CIFAR-10, in our case). However, in most transfer learning tutorials, you'll find that the mean and std values for ImageNet are used.

Below, I give you two functions to calculate the mean and std of a dataset:

1. "calculate_img_stats_avg" is based on DataLoader and calculates means and stds of each batch of data as it is retrieved from the dataset object, and finally takes the average of the accumulated means and std values. Although this gives us an approximation of the actual values, it is reasonable to use for large datasets that won't fit into memory at the same time. This code has been adapted from the PyTorch forum.

2. "calculate_img_stats_full" calculates the actual mean and std of the whole dataset by working on it at once. This gives more accurate values, but will most likely run out of memory for large datasets. For CIFAR-10, this function requires 28GB of RAM. My machine has 32GB but it falls short and I am unable to run this function. This code has been adapted from the book "Deep Learning with PyTorch" by Eli Stevens and Luca Antiga, Manning Publications.

You can try to run the second function on your specific dataset and if you run into memory issues, then revert to the first one for a good approximation. In case of CIFAR-10, however, many people have calculated the mean and std of the dataset and the values are well known, like ImageNet. We are using those values in the code that follows. I did not try with the approximate values given by the first function but you are welcome to do so .

In [ ]:

from torchvision import transforms

transform = transforms.Compose([transforms.ToTensor()])



We first create a dataset from full data and then a DataLoader to feed the data in batches of size 50 to our loop. Note that for DataLoader to work, the images have to be converted to a tensor, so that is the only transform we are using.

The function below is a straightforward implementation that calculates the mean and std of each batch and adds them to their cumulative sums, dividing in the end by the total number of batches to get the averages.

In [4]:

def calculate_img_stats_avg(loader):
mean = 0.
std = 0.
nb_samples = 0.
batch_samples = imgs.size(0)
imgs = imgs.view(batch_samples, imgs.size(1), -1)
mean += imgs.mean(2).sum(0)
std += imgs.std(2).sum(0)
nb_samples += batch_samples

mean /= nb_samples
std /= nb_samples
return mean,std


In [5]:

calculate_img_stats_avg(loader)

Out[5]:

(tensor([0.4914, 0.4822, 0.4465]), tensor([0.2023, 0.1994, 0.2010]))

In [ ]:

def calculate_img_stats_full(dataset):
imgs_ = torch.stack([img for img,_ in dataset],dim=3)
imgs_ = imgs_.view(3,-1)
imgs_mean = imgs_.mean(dim=1)
imgs_std = imgs_.std(dim=1)
return imgs_mean,imgs_std

In [ ]:

calculate_img_stats_full(dataset)

The torch.stack function above stacks the data along the given dimension (3, in our case). The view operation views the tensor as a 3 x (product of all other dimensions) which basically flattens while keeping the first dimension as 3. The best way to visualize what is going on in an obscure function like this is to copy isolate the statements and feed them some dummy tensors to see what's going on. I leave it for you as an exercise. Values below have been taken from the same book (referred above from which the code has been taken):

In [12]:

cifar10_mean = [0.4915, 0.4823, 0.4468]
cifar10_std  = [0.2470, 0.2435, 0.2616]

Now we can create our datasets again from scratch with all the transformations, augmentations, and normalization applied—splitting them into train and test, and obtaining the final DataLoaders. Note that we also define our batch size = 50.

In [13]:

batch_size = 50
train_transform = transforms.Compose([transforms.Resize((224,224)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ToTensor(),
transforms.Normalize(cifar10_mean, cifar10_std)
])

test_transform = transforms.Compose([transforms.Resize((224,224)),
transforms.ToTensor(),
transforms.Normalize(cifar10_mean, cifar10_std)
])

train_data = datasets.CIFAR10('Cifar10', train=True,
test_data = datasets.CIFAR10('Cifar10', train=False,



toTensor() converts a numpy array to a PyTorch Tensor (all our images are constructed as numpy arrays by the dataset class when read from disk). normalize() is a transform that normalizes according to the passed values of means and STD of each channel as separate lists or tuples.

Out[13]:

(800, 200, 200)

### Data Augmentation is (Mostly) Applied to Train Set Only

We usually don't apply data augmentation to a test set because we want the test data to remain as close to real data as possible, otherwise there's a chance that we may overestimate performance. For example, our model may have misclassified a test image but is correct for its flipped and rotated versions. This may increase the overall accuracy but is misleading.

Having said that, there is a technique called test-time augmentation (TTA) where we augment test data and average out the predictions after showing the trained model all the (augmented) variations of an image with the original one while testing. This may result in better accuracy sometimes. We are not going to use it in this tutorial but you can find out more here.

# Create a Base Class for Building a Basic Neural Network

Now that we have our DataLoaders all prepared, we are ready to define our neural network and train it. In order to define a neural network, the best way is to define classes that isolate and abstract out functionality common to all types of networks like training loops, validation, evaluation, prediction, setting different hyperparameters, etc.

We also need to define classes that implement specific types of networks, such as those specialized for transfer learning or tailor-made for a fully connected operation, etc. Keeping this in mind, we will create three main classes:

• A base class representing a neural network derived from PyTorch's core nn.Module class which is the foundation of any neural network in PyTorch
• A class derived from our base class that implements functionality specific to transfer learning
• A class derived from our base class that implements functionality specific to fully connected networks

Let's build our base class called network step by step.

In [ ]:

class Network(nn.Module):
def __init__(self,device=None):
super().__init__()
if device is not None:
self.device = device
else:
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def forward(self,x):
pass


Note that the forward method is called by nn.Module's "call" method. So the object of our class can become a "callable" and when it is called, the forward method shall be automatically invoked. Please refer to any good Python tutorial if you want to know more about callables.

## Train Method

Next we add the train method. To train any neural network, there are a few common tasks that need to be performed in each iteration of the training loop. The following outline of the training loop is the logic of the inner part of the loop that performs actual training in each epoch. This part of the code goes through each batch. It basically defines a single epoch (single pass through the whole dataset):

• Get the next batch of data
• Move the Tensors of the batch to the device (GPU or CPU)
• Zero out the gradients of all weights
• Call the forward function to send the inputs through the network
• Pass the outputs obtained to the criterion (loss function) to compare them against the labels (targets) and calculate the loss
• Update all the weights according to the gradients and the learning rate
• Update the overall loss within this epoch

These steps are common to all frameworks and neural network types. The following code in train_ method performs these steps.

In [ ]:

class Network(nn.Module):
...
self.train()
t0 = time.time()
batches = 0
running_loss = 0
batches += 1
#t1 = time.time()

inputs, labels = inputs.to(self.device), labels.to(self.device)
outputs = self.forward(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
loss = loss.item()
#print('training this batch took {:.3f} seconds'.format(time.time() - t1))
running_loss += loss
if batches % print_every == 0:
print(f"{time.asctime()}.."
f"Time Elapsed = {time.time()-t0:.3f}.."
f"Average Training loss: {running_loss/(batches):.3f}.. "
f"Batch Training loss: {loss:.3f}.. "
)
t0 = time.time()


self.train() is a built-in PyTorch method in the base class (nn.Module) that sets a flag on the model object indicating that training is in progress. This flag is used by several PyTorch modules that behave differently during training and validation/testing, e.g. dropout, batch normalization, etc. Criterion is the loss function that calculates the difference between the output of the network and the actual labels. loss.backward() performs backpropagation, calculating the gradients throughout the network following the complete graph of connected tensors. optimizer.step performs one step of the optimizer algorithm after the loss function has executed and the new gradients are available. item() method gives a scalar value. It is used for tensors that return a single value (loss is a floating point numerical value in this case).

### Loss Functions

Note that PyTorch comes with many built-in loss functions for common cases like classification and regression, etc. Here we are passing the loss function to train_ as an argument. Some common loss functions used in classification are CrossEntropy loss, Negative Likelihood Log Loss (NLLLoss) and Binary-CrossEntropy). We will discuss more about loss function when we discuss the fully connected class later in this tutorial.

### Optimizer Module

Optimizer module applies gradient descent or its variant and performs weight updates with gradients and learning rates. Optimizers come in several flavors with different algorithms and are found in torch.optim module. Examples include Stochastic Gradient Descent (SGD), Adam, AdaDelta, etc.

## Validate Method

The purpose of the validate method, which applies the model to the validation set for evaluation, is to periodically assess how we are doing in terms of training. If you are familiar with machine learning concepts, you most likely know about bias (underfitting) and variance (overfitting). If our loss on validation set is significantly and consistently higher than the loss on training set, we are overfitting. This basically means our model will not generalize well enough on any other dataset because we are too tightly bound to the training set.

The idea here is to evaluate the model on the validation set after every few epochs (a good default is after every epoch), measure the loss, and print it out to see if we are overfitting. The difference between validate method and train is that in validation we don't need to back-propagate, calculate the gradients, apply gradient descend, and update the weights. All we need is to pass the validation data set batch by batch through our model and evaluate the loss using the loss function. As our model gets better after some epochs, we should see our validation loss going down.

One additional thing we also want to do in validation is to calculate the accuracy of our classification. This is simply the percentage of how many times we are correct in our prediction: 100 x (number of correctly predicted classes/dataset size). However, it would be better if we also calculate class-wise accuracy, i.e. for each individual class we calculate how many of that class we got right versus the total number of images we have of that class.

As shown below, we will write a utility function to calculate class-wise accuracies. This may come in handy when we do predictions on our test set or any other set of images.

In [ ]:

from collections import defaultdict

def update_classwise_accuracies(preds,labels,class_correct,class_totals):

correct = np.squeeze(preds.eq(labels.data.view_as(preds)))
for i in range(labels.shape[0]):
label = labels.data[i].item()
class_correct[label] += correct[i].item()
class_totals[label] += 1

class Network(nn.Module):
...

running_loss = 0.
accuracy = 0
class_correct = defaultdict(int)
class_totals = defaultdict(int)
self.eval()
inputs, labels = inputs.to(self.device), labels.to(self.device)
outputs = self.forward(inputs)
loss = self.criterion(outputs, labels)
running_loss += loss.item()
_, preds = torch.max(torch.exp(outputs), 1)
update_classwise_accuracies(preds,labels,class_correct,class_totals)
accuracy = (100*np.sum(list(class_correct.values()))/np.sum(list(class_totals.values())))
self.train()
return (running_loss/len(validloader),accuracy) 

self.eval() is a PyTorch method that puts the model into evaluation mode. It tells PyTorch we only want to perform forward pass through the network and no backpropagation. It is opposite of the train() we had in our training loop. Whatever we put in torch.no_grad() block tells PyTorch not to compute gradients. We want to make sure that gradients are never calculated within the evaluation loop.

np.squeeze(preds.eq(labels.data.view_as(preds))) seems like a pretty obscure statement so let's break it down:

• The actual labels are contained in our DataLoader's data attribute.
• Predictions are the output of our network.
• The view_as method reorganizes a tensor according to the dimensions of the tensor passed as the argument. In our case, this statement will align the labels in the batch with the predictions tensor, i.e. batch_size x 10 since there are 10 classes of our network and our final fully connected layer would emit these many outputs for each batch.
• The eq method compares each row of a tensor and emits a 1 (True) where the rows are equal and 0 otherwise.
• The final result would be a 50 x 1 Tensor which we flatten by squeezing out extra batch dimension to make it a 50-dimensional vector (1-dimensional tensor), containing either 1s (where predictions are equal to labels) or 0s (where they are unequal).

_, preds = torch.max(torch.exp(outputs), 1)

• We will use log softmax along with the negative log-likelihood as loss (NLLLoss) function in our fully connected model (more on this later). Therefore, our outputs are expected to be log of probability values (also called logits). We don't strictly need to exponentiate the logits here as the max of the logits would still give us the same class index. We are doing it here just to make our predictions look like probabilities which sometimes helps in debugging. You are free to remove torch.exp call in the code if you want. torch.max returns a tuple containing the maximum value and the index of the maximum value within the tensor. Since the index in our case represents the classified category itself, we will only take that ignoring the actual probability.

## Evaluate Method

The purpose of the evaluate method is to assess the performance of our model after training has completed on a test dataset. The assumption is that we have labels available for the dataset we want to pass to this method.

The code is almost the same as validate. The only difference is that we don't have to calculate loss in this case since we are done with the training.

Since this method returns the overall accuracy as well as class-wise accuracies, we need another utility function get_accuracies. We also need class_names to get the actual names of the classes (if available). We will store the class names as a dictionary mapping ids (numbers) to class name strings when we create our transfer learning Model (later in this tutorial).

In [ ]:

from collections import defaultdict

def update_classwise_accuracies(preds,labels,class_correct,class_totals):
correct = np.squeeze(preds.eq(labels.data.view_as(preds)))
for i in range(labels.shape[0]):
label = labels.data[i].item()
class_correct[label] += correct[i].item()
class_totals[label] += 1

def get_accuracies(class_names,class_correct,class_totals):

accuracy = (100*np.sum(list(class_correct.values()))/np.sum(list(class_totals.values())))
class_accuracies = [(class_names[i],100.0*(class_correct[i]/class_totals[i]))
for i in class_names.keys() if class_totals[i] > 0]
return accuracy,class_accuracies

class Network(nn.Module):
...

self.eval()
self.model.to(self.device)
class_correct = defaultdict(int)
class_totals = defaultdict(int)
inputs, labels = inputs.to(self.device), labels.to(self.device)
outputs = self.forward(inputs)
ps = torch.exp(outputs)
_, preds = torch.max(ps, 1)
update_classwise_accuracies(preds,labels,class_correct,class_totals)

self.train()
return get_accuracies(self.class_names,class_correct,class_totals)

We get the class name and the accuracy of the specific class by dividing the correct predictions by the total number of images of that class we have in the test set. We put an extra condition that we have at least one image of a class to avoid dividing by 0.

## Predict Method

The predict method is used to predict or draw inference from our trained model to determine the class of images for which we do not have labels. This is the method that would be called when the model is deployed in real life.

It is very similar to evaluate, except that there are no labels and that we are also interested in probabilities as well as predicted classes. We may also want to know the predicted probabilities of more than one classes, e.g. top 3 most likely classes predicted along with their indices.

In [ ]:

class Network(nn.Module):
...
def predict(self,inputs,topk=1):
self.eval()
self.model.to(self.device)
inputs = inputs.to(self.device)
outputs = self.forward(inputs)
ps = torch.exp(outputs)
p,top = ps.topk(topk, dim=1)
return p,top

Since we need probabilities and (possibly) multiple ranked classeswe pass the topk argument that tells our function how many ranked classes with their probabilities we have to return. The topk method of a tensor in PyTorch returns k indices and their values from a tensor along a dimension (dim=1 means along each row, i.e. horizontally. Since our tensor is 50 x number of classes, this would return the topk classes and their probabilities in each row.)

## Fit Method

This is the main method that is called by the user of our class to kick off training. It implements the main training loop that implements the epoch loop.

It calls train_ method and validation periodically to monitor performance and overfitting, keeps track of the best accuracy achieved so far, and saves the best accuracy model and full model (along with its hyper-parameters and other variables to disk as a checkpoint). Checkpoints can be restored and training continued later if power is lost or training is disrupted due to some reason.

Let's build this method step by step below.

In [ ]:

class Network(nn.Module):
...

for epoch in range(epochs):
self.model.to(self.device)
print('epoch {:3d}/{}'.format(epoch+1,epochs))

self.optimizer,print_every)
if  validate_every and (epoch % validate_every == 0):
t2 = time.time()
time_elapsed = time.time() - t2
print(f"{time.asctime()}--Validation time {time_elapsed:.3f} seconds.."
f"Epoch {epoch+1}/{epochs}.. "
f"Epoch Training loss: {epoch_train_loss:.3f}.. "
f"Epoch validation loss: {epoch_validation_loss:.3f}.. "
f"validation accuracy: {epoch_accuracy:.3f}")

self.train()


### Saving the Best Accuracy Model

The fit function should also monitor the best accuracy achieved so far across all epochs and save the best accuracy model as soon as it gets a new one better than the previous best. This ensures that even without checkpoints, we should be able to retrieve our best model if our validation loss starts to go down while training.

This is a common scenario as training may take hours to complete and we may have to leave the system unattended. This way we can ensure that we always reload the best accuracy model's weights and use them for inference.

In [ ]:

from collections import defaultdict
import math

class Network(nn.Module):
def __init__(self,device=None):
...
self.best_accuracy = 0.

...

for epoch in range(epochs):
self.model.to(self.device)
print('epoch {:3d}/{}'.format(epoch+1,epochs))
self.optimizer,print_every)

if  validate_every and (epoch % validate_every == 0):
t2 = time.time()
time_elapsed = time.time() - t2
print(f"{time.asctime()}--Validation time {time_elapsed:.3f} seconds.."
f"Epoch {epoch+1}/{epochs}.. "
f"Epoch Training loss: {epoch_train_loss:.3f}.. "
f"Epoch validation loss: {epoch_validation_loss:.3f}.. "
f"validation accuracy: {epoch_accuracy:.3f}")
if self.best_accuracy == 0. or (epoch_accuracy > self.best_accuracy):
print('updating best accuracy: previous best = {:.3f} new best = {:.3f}'.format(self.best_accuracy,
epoch_accuracy))
self.best_accuracy = epoch_accuracy
torch.save(self.state_dict(),self.best_accuracy_file)

self.train() # just in case we forgot to put the model back to train mode in validate



torch.save() saves and serializes a tensor data structure with Python's pickle module. Here we are storing the model's state dictionary returned by state_dict() method that contains all the weights of the model's full graph (each tensor in the architecture). Note that the self.best_accuracy_file shall be the filename set during initialization of the model parameters (see next section).

### Setting and Getting Different Parameters and Hyperparameters

We need to set different model parameters and model hyperparameters. These include loss function (criterion), optimizer, dropout probability, learning rate, and some others. Here are four methods:

• set_criterion to create an instance of the loss function and set it on the model
• set_optimizer to create an instance of the optimizer and set it on the model
• set_model_params that calls the above two functions and sets additional hyperparameters on the model object
• get_model_params that retrieves the currently set parameters on a model. This will come handy when we want to save a full model checkpoint.

In [ ]:

class Network(nn.Module):
...

def set_criterion(self,criterion_name):
if criterion_name.lower() == 'nllloss':
self.criterion_name = 'NLLLoss'
self.criterion = nn.NLLLoss()
elif criterion_name.lower() == 'crossentropyloss':
self.criterion_name = 'CrossEntropyLoss'
self.criterion = nn.CrossEntropyLoss()

from torch import optim

self.optimizer_name = optimizer_name
elif optimizer.lower() == 'sgd':
print('setting optim SGD')
self.optimizer = optim.SGD(params,lr=lr)

def set_model_params(self,
criterion_name,
optimizer_name,
lr, # learning rate
dropout_p,
model_name,
best_accuracy,
best_accuracy_file,
class_names):

self.set_criterion(criterion_name)
self.set_optimizer(self.parameters(),optimizer_name,lr=lr)
self.lr = lr
self.dropout_p = dropout_p
self.model_name =  model_name
self.best_accuracy = best_accuracy
self.best_accuracy_file = best_accuracy_file
self.class_names = class_names

def get_model_params(self):
params = {}
params['device'] = self.device
params['model_name'] = self.model_name
params['optimizer_name'] = self.optimizer_name
params['criterion_name'] = self.criterion_name
params['lr'] = self.lr
params['dropout_p'] = self.dropout_p
params['best_accuracy'] = self.best_accuracy
params['best_accuracy_file'] = self.best_accuracy_file
params['class_names'] = self.class_names
return params

• set_criterion supports two loss functions: cross-entropy and NLLLoss. However, support for other loss functions can be trivially added by adding more if else statements.
• It is passed the name of the loss function and it instantiates an object using PyTorch API.
• set_optimizer similarly enables the optimizer by instantiating it using PyTorch API. It supports 'Adam' as default while SGD and Adadelta can be set. Again support for other optimizers can be easily added.
• set_model_params is a higher level method that calls set_criterion and set_optimizer as well as other parameters like model_name, current value of best accuracy, best_accuracy_file where we store the best accuracy, model weights, learning rate, and dropout probability.
• We have omitted sanity checking for correctness of the types of parameters for brevity (e.g. model_name and optimizer_name should be strings, and dropout_p and lr should be a float, etc.).
• The set_model_param method shall be called from the main model classes, e.g. transfer learning and fully connected models whose classes we shall next derive from this base network class.
• get_model_param simply returns the current parameters as a dictionary. It will be used in creating the checkpoint (see next).
• class_names is a dictionary that contains a mapping of class identifiers (integers) to class names (strings) if such a mapping is available.

### Saving a Model Checkpoint

Saving a checkpoint of a model is an important task when training deep learning models. This way we can comfortably execute long-running training loops. If any disruption happens, e.g. the machine crashes, power fails, Jupyter Notepook crashes, or any other unforeseen issue and our training is interrupted, we can restore from the last checkpoint and continue training. Our (potentially) hours of training shall not be lost.

Now, we will implement a method, save_checkpoint. Later in this tutorial, we will implement a utility function load_checkpoint when we have the derived classes from this base class for fully connected and transfer learning models and we know which type of model we need to instantiate (we will add that information to the store_chkpoint at that time).

In [ ]:

class Network(nn.Module):
...
def set_model_params(self,
criterion_name,
optimizer_name,
lr, # learning rate
dropout_p,
model_name,
best_accuracy,
best_accuracy_file,
chkpoint_file):

self.criterion_name = criterion_name
self.set_criterion(criterion_name)
self.optimizer_name = optimizer_name
self.set_optimizer(self.parameters(),optimizer_name,lr=lr)
self.lr = lr
self.dropout_p = dropout_p
self.model_name =  model_name
self.best_accuracy = best_accuracy
print('set_model_params: best accuracy = {:.3f}'.format(self.best_accuracy))
self.best_accuracy_file = best_accuracy_file
self.chkpoint_file = chkpoint_file

def get_model_params(self):
params = {}
params['device'] = self.device
params['model_name'] = self.model_name
params['optimizer_name'] = self.optimizer_name
params['criterion_name'] = self.criterion_name
params['lr'] = self.lr
params['dropout_p'] = self.dropout_p
params['best_accuracy'] = self.best_accuracy
print('get_model_params: best accuracy = {:.3f}'.format(self.best_accuracy))
params['best_accuracy_file'] = self.best_accuracy_file
params['chkpoint_file'] = self.chkpoint_file
print('get_model_params: chkpoint file = {}'.format(self.chkpoint_file))
return params

def save_chkpoint(self):
saved_model = {}
saved_model['params'] = self.get_model_params()
torch.save(saved_model,self.chkpoint_file)
print('checkpoint created successfully in {}'.format(self.chkpoint_file))


# Create a Fully Connected Class Derived From the Base Class

Now we are ready to create our first derived class for fully connected neural networks. Fully connected networks are traditionally called multilayer perceptrons (MLP) in the literature. In most deep learning frameworks (including PyTorch), they are simply called linear layers.

In order to have a functional class for a fully connected network, we will rely on PyTorch's nn.Linear module. Note: nn.Linear module is itself derived from nn.Module from which we derived our own network class.

A fully connected network consists of three basic pieces:

• Inputs
• Fully connected hidden layers with each one followed by a non-linear transformation (let's consider the non-linearity as part of the hidden layer instead of treating it as a separate layer)
• An output layer and the number of outputs

## Fully Connected Network Requirements

We need to meet the following requirements to create such a class:

• Ability to specify as many hidden layers as desired
• Ability to specify the number of inputs and outputs of the model
• Ability to define drop out and non-linearity ('relu,' tanh, etc.) for each layer
• Ability to define the output layer and prepare it for the classification task
• Set different parameters and hyperparameters of the model like optimizer, loss function, etc.

Given these requirements, let's define a class for a fully-connected model.

In [ ]:

class FC(Network):
def __init__(self,num_inputs,
num_outputs,
layers=[],
lr=0.003,
class_names=None,
dropout_p=0.2,
non_linearity='relu',
criterion_name='NLLLoss',
model_type='classifier',
best_accuracy=0.,
best_accuracy_file ='best_accuracy.pth',
chkpoint_file ='chkpoint_file.pth',
device=None):

super().__init__(device=device)

self.set_model_params(criterion_name,
optimizer_name,
lr,
dropout_p,
'FC',
best_accuracy,
best_accuracy_file,
chkpoint_file
)

• num_inputs is the total number of input features this network is going to accept.
• num_outputs is the total number of outputs this network is going to emit after passing through any hidden layers. In other words, this is the dimension of the output layer.
• Non-linearity is stored in the model as an attribute. Note that we do not pass the non-linearity to set_model_params as this is model specific and does not belong to the base class.
• We may have to implement our versions of set_model_params and get_model_params methods later if we want to set and get additional parameters specific to the model. This is like implementing our own " init " and then calling the parent's too. We do additional work in our code and then call the parent to do the common work.
• Layers is a list specifying the number of units in each hidden layer. The order of numbers in this list would also specify their order in the model.

## Defining the Network Using nn.Sequential

• nn.Sequential is a PyTorch method to create a simple sequential neural network that just concatenates the defined modules.
• At the time of execution, nn.Sequential automatically calls the forward methods of each module in the sequence.

Here we define an empty nn.Sequential first and then add the input module, hidden layers, and output module to it.

In [ ]:

class FC(Network):
def __init__(self,num_inputs,
num_outputs,
layers=[],
lr=0.003,
class_names=None,
dropout_p=0.2,
non_linearity='relu',
criterion_name='NLLLoss',
model_type='classifier',
best_accuracy=0.,
best_accuracy_file ='best_accuracy.pth',
chkpoint_file ='chkpoint_file.pth',
device=None):

super().__init__(device=device)

self.set_model_params(criterion_name,
optimizer_name,
lr,
dropout_p,
'FC',
best_accuracy,
best_accuracy_file,
chkpoint_file
)

self.non_linearity = non_linearity
self.model = nn.Sequential()
if len(layers) > 0:

for i in range(1,len(layers)):

else:


Here we create groups of layers and add them to the Sequential model. Each group consists of a linear layer followed by a non-linearity and dropout with probability passed as an argument. If we don't have any hidden layer we just add one layer to our sequential model with number of inputs and number of outputs. In this case, we don't add any non-linearity or dropout since non-linearity is typically added in hidden layers.

• nn.Linear is a PyTorch class that takes the number of inputs and the number of outputs and creates a linear model with internal forward function.
• Note that we are naming our output layer as 'out' and our hidden layers as 'fcX' where X is the layer number (1, 2 ..),

## Loss Functions for Classification

We can broadly divide linear networks into two types: regression and classification. Although there are many loss functions used for classification, two most common ones which can be generalized easily from 2 to any number of classes are:

• Negative Likelihood Log Loss or NLLLoss
• CrossEntropy Loss

### NLLLoss

The NLLLoss function is very simple. It assumes that its input is a probability. It just takes the -ve of the log of its input for each input and adds them up. You can read more about it here.

We first need to convert the outputs to probabilities before feeding to the NLLLoss. The simplest way to do that is to take the softmax of the inputs by taking the exponent of each input and dividing by the sum of the exponents (more info on the same link above). After this operation the outputs can be interpreted as probabilities (because they have been scaled or calibrated between 0 and 1), which are then fed to the NLLLoss, which outputs sum(-log(p)) where p is the output of each probability.

However, in PyTorch the NLLLoss function expects that the log has already been calculated and it just puts a -ve sign and sums up the inputs. Therefore, we need to take the log ourselves after softmax. There is a convenient function in PyTorch called LogSoftmax that does exactly that. So we will use it if our loss function is specified to be 'NLLLoss' by adding that after our output layer in the sequential.

### Cross-Entropy Loss

If we were using cross-entropy loss, we do nothing as the CrossEntropyLoss function will do what's required.

In [ ]:

class FC(Network):
def __init__(self,num_inputs,
num_outputs,
layers=[],
lr=0.003,
class_names=None,
dropout_p=0.2,
non_linearity='relu',
criterion_name='NLLLoss',
model_type='classifier',
best_accuracy=0.,
best_accuracy_file ='best_accuracy.pth',
chkpoint_file ='chkpoint_file.pth',
device=None):

super().__init__(device=device)

self.set_model_params(criterion_name,
optimizer_name,
lr,
dropout_p,
'FC',
best_accuracy,
best_accuracy_file,
chkpoint_file
)

self.non_linearity = non_linearity

self.model = nn.Sequential()

if len(layers) > 0:

for i in range(1,len(layers)):
inplace=True))

else:
if model_type.lower() == 'classifier' and criterion_name.lower() == 'nllloss':
self.num_inputs = num_inputs
self.num_outputs = num_outputs
self.layer_dims = layers
if class_names is not None:
self.class_names = class_names
else:
self.class_names = {str(k):v for k,v in enumerate(list(range(num_outputs)))}


### Flattening the Inputs

Before we can feed inputs to our FC network, we need to flatten the input tensor so that each row is just a one-dimensional tensor and we have a batch of those rows. In other words, the inputs have to be two-dimensional (rows by columns) as most of you might be familiar with tabular data (from CSV files, for example) used in machine learning. This is a requirement of the linear layer that it expects its data to be in batches of single dimensional tensors (vectors).

To achieve this, we simply have to change the view of our input tensors (if they are already two-dimensional, nothing will change in the view). To do so we define a simple one-linear function as a utility. This makes the code much more readable as we immediately know that a flattening operation is going on instead of a rather cryptic .view statement.

In [ ]:

def flatten_tensor(x):
return x.view(x.shape[0],-1)

class FC(Network):
def __init__(self,num_inputs,
num_outputs,
layers=[],
lr=0.003,
class_names=None,
dropout_p=0.2,
non_linearity='relu',
criterion_name='NLLLoss',
model_type='classifier',
best_accuracy=0.,
best_accuracy_file ='best_accuracy.pth',
chkpoint_file ='chkpoint_file.pth',
device=None):

super().__init__(device=device)

self.set_model_params(criterion_name,
optimizer_name,
lr,
dropout_p,
'FC',
best_accuracy,
best_accuracy_file,
chkpoint_file
)

self.non_linearity = non_linearity

self.model = nn.Sequential()

if len(layers) > 0:

for i in range(1,len(layers)):
inplace=True))

else:

if model_type.lower() == 'classifier' and criterion_name.lower() == 'nllloss':

self.num_inputs = num_inputs
self.num_outputs = num_outputs
self.layer_dims = layers

if class_names is not None:
self.class_names = class_names
else:
self.class_names = {str(k):v for k,v in enumerate(list(range(num_outputs)))}
def forward(self,x):
return self.model(flatten_tensor(x))


### Setting and Getting Dropout

We add two more convenience methods that give us the ability to change dropout probability any time we want. This might come in handy when we want to experiment quickly with different dropout probability values or may change dropout dynamically while training based on some condition, e.g. detecting heavy overfitting.

In [ ]:

class FC(Network):
...
def _get_dropout(self):
for layer in self.model:
if type(layer) == torch.nn.modules.dropout.Dropout:
return layer.p

def _set_dropout(self,p=0.2):
for layer in self.model:
if type(layer) == torch.nn.modules.dropout.Dropout:
print('FC: setting dropout prob to {:.3f}'.format(p))
layer.p=p

Here we are checking each layer for this type of module and, if true, we act accordingly in set and get methods. Note that the torch.nn.modules.dropout.Dropout has an attribute p where the dropout probability is stored.

There are four additional attributes of our FC model we need to save in order to restore it correctly. These are numb_inputs, num_outputs, layers, and class_names. Since these are quite specific to FC model, we should write FC model's versions of get_model_param and set_model_param methods that internally call the base class ones and also perform any additional stuff.

So let's do that and complete our class before writing our restore_model function.

In [ ]:

class FC(Network):
...

def set_model_params(self,
criterion_name,
optimizer_name,
lr,
dropout_p,
model_name,
model_type,
best_accuracy,
best_accuracy_file,
chkpoint_file,
num_inputs,
num_outputs,
layers,
class_names):
super(FC, self).set_model_params(criterion_name,
optimizer_name,
lr,
dropout_p,
model_name,
best_accuracy,
best_accuracy_file,
chkpoint_file
)

self.num_inputs = num_inputs
self.num_outputs = num_outputs
self.layer_dims = layers
self.model_type = model_type

if class_names is not None:
self.class_names = class_names
else:
self.class_names = {k:str(v) for k,v in enumerate(list(range(num_outputs)))}

def get_model_params(self):
params = super(FC, self).get_model_params()
params['num_inputs'] = self.num_inputs
params['num_outputs'] = self.num_outputs
params['layers'] = self.layer_dims
params['model_type'] = self.model_type
params['class_names'] = self.class_names
params['device'] = self.device
return params


Now let's create a load_chkpoint utility function which is given a checkpoint file to retrieve the model parameters and reconstruct the appropriate model. Since we have only one model type right now (FC), we will put a check for that model_type only and later add support for transfer learning and any other classes as we create them. The code is pretty straight forward as it gets the params dictionary from the chkpoint_file and calls the appropriate constructor and finally loads the state dictionary of the best accuracy model from the filename of the best accuracy model retrieved from the chkpoint_file.

In [ ]:

def load_chkpoint(chkpoint_file):

params = restored_data['params']

if params['model_type'].lower() == 'classifier':
net = FC( num_inputs=params['num_inputs'],
num_outputs=params['num_outputs'],
layers=params['layers'],
device=params['device'],
criterion_name = params['criterion_name'],
optimizer_name = params['optimizer_name'],
model_name = params['model_name'],
lr = params['lr'],
dropout_p = params['dropout_p'],
best_accuracy = params['best_accuracy'],
best_accuracy_file = params['best_accuracy_file'],
chkpoint_file = params['chkpoint_file'],
class_names =  params['class_names']
)

net.to(params['device'])

return net


This completes our FC class. Now we should test it before proceeding further. Let's test it on our MNIST dataset.

First, we should calculate the MNIST dataset's mean and std values. They can be calculated without getting into any memory issues in a couple seconds with the function we created earlier for this purpose.

In [10]:

train_data = datasets.MNIST(root='data',download=False,
transform = transforms.transforms.ToTensor())
mean_,std_= calculate_img_stats(train_data)
mean_,std_

Out[10]:

(tensor([0.0839, 0.2038, 0.1042]), tensor([0.2537, 0.3659, 0.2798]))

We create the transforms as before using the calculated mean and std values, and then apply them to our train and test sets, and then split our train set into train and validation. Remember that our split_image_data function just converts the test set into a DataLoader if it is given as an argument.

In [18]:

train_transform = transforms.Compose([transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ToTensor(),
transforms.Normalize([0.0839, 0.2038, 0.1042],[0.2537, 0.3659, 0.2798])
])

test_transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize([0.0839, 0.2038, 0.1042],[0.2537, 0.3659, 0.2798])
])

In [19]:

train_dataset = datasets.MNIST(root='data',download=False,train=True, transform = train_transform)
test_dataset = datasets.MNIST(root='data',download=False,train=False,transform = test_transform)

In [20]:

trainloader,validloader,testloader = split_image_data(train_dataset,test_dataset,batch_size=50)
len(trainloader),len(validloader),len(testloader)

Out[20]:

(960, 240, 200)

We create an FC layer with number of inputs = 784 which is obtained after flattening the image dimensions (1 x 28 x 28) and number of outputs = 10 since we have 10 classes (digits 0 to 9). Then, we arbitrarily select two hidden layers of 512 units each, and set the optimizer to Ada Delta (more on it in the next section). We set best accuracy and checkpoint files as appropriate.

In [21]:

net =  FC(num_inputs=784,
num_outputs=10,
layers=[512,512],
best_accuracy_file ='best_accuracy_mnist_fc_test.pth',
chkpoint_file ='chkpoint_file_mnist_fc_test.pth')
setting optim Ada Delta

## Optimizer Choices

Optimizer algorithms come in many variations and forms. Most of them try to optimize the basic gradient descent algorithm by varying the learning rate and other related parameters as they see the data. A full survey of optimizers is beyond the scope of this tutorial. For a detailed overview, click here

The main difference between most frequently used ones (adapted from the link above) are as follows:

• Batch Gradient Descent is the simplest of the optimizer algorithms and performs weight updates after looking at the entire dataset.
• SGD (Stochastic Gradient Descent) is on the other extreme and performs weight updates for each item (training example) in the dataset.
• Mini-batch GD is a variant of SGD and takes the best of both worlds. It updates weights after each mini-batch of data. In other words, pure SGD is mini-batch with a batch size of 1. Anything in between 1 and the entire dataset, we call it a mini-batch GD.
• Momentum is a method that helps accelerate SGD in the relevant direction and attempts to dampen too many oscillations when trying to converge to a minimum.
• Adagrad adapts the learning rate to the parameters, and applies different learning rates for updating different parameters, based on the past history of the squares of the magnitude of gradients of each parameter. The main advantage of Adagrad is that the user does not have to tune the learning rate manually.
• RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera class. RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates.

In my experimentation, Adadelta gives the highest accuracy on image datasets in general, much better than Adam and SGD, especially on Cifar10, although I admit I haven't tried Adadelta and RMSProp much. You should try these with your experiments to see if they make any difference.

Next, we call our fit function, passing it to train and validation dataloaders, and train on 5 epochs, printing every 300 batches in each epoch while performing validation on every epoch (remember that the default of validate_every = 1).

In [ ]:

net.fit(trainloader,validloader,epochs=5,print_every=300)

We get the following output (partial output shown for brevity):

updating best accuracy: previous best = 95.883 new best = 95.992

This shows that we got reasonable accuracy (95.99) in only 5 epochs. Perhaps we can squeeze more juice out of it by training on a few more epochs.

So let's first test our save and load chkpoint functions and then continue training on another 10 epochs.

In [23]:

net.save_chkpoint()
get_model_params: best accuracy = 95.992
get_model_params: chkpoint file = chkpoint_file_mnist_fc_test.pth
checkpoint created successfully in chkpoint_file_mnist_fc_test.pth

We load the saved chkpoint into another variable to ensure that it is a new model.

In [24]:

net2 = load_chkpoint('chkpoint_file_mnist_fc_test.pth')

Out[24]:

load_chkpoint: best accuracy = 95.992
setting optim Ada Delta)

In [ ]:

net2.fit(trainloader,validloader,epochs=10,print_every=300)

updating best accuracy: previous best = 96.392 new best = 96.875

The best accuracy we could achieve on validation set after another 10 epochs is 96.875. Let's save and restore the model one more time before testing our evaluate method.

In [26]:

net2.save_chkpoint()

Out[26]:

get_model_params: best accuracy = 96.875
get_model_params: chkpoint file = chkpoint_file_mnist_fc_test.pth
checkpoint created successfully in chkpoint_file_mnist_fc_test.pth

In [27]:

net3 = load_chkpoint('chkpoint_file_mnist_fc_test.pth')

Out[27]:

load_chkpoint: best accuracy = 96.875
setting optim Ada Delta

In [28]:

net3.evaluate(testloader)

Out[28]:

(96.95,
[('0', 98.87755102040816),
('1', 99.03083700440529),
('2', 96.70542635658916),
('3', 94.75247524752474),
('4', 97.35234215885947),
('5', 95.73991031390135),
('6', 96.4509394572025),
('7', 96.78988326848248),
('8', 97.1252566735113),
('9', 96.33300297324084)])

Let's also test our predict function next. To do that, we need to convert our test loader into a Python iterator, and then get the next batch from it using "next" method of the iterator. If you are not familiar with Python iterators, please see any good tutorial such as this one.

In [29]:

iterator = iter(testloader)
imgs_,labels_ = next(iterator)

In [32]:

imgs_[0].shape,labels_[0].item()

Out[32]:

(torch.Size([1, 28, 28]), 7)

We can see above that first image of our first batch is 1 x 28 x 28 while its label = 7. We can verify this by displaying the image using Python's matplotlib library after converting the image to numpy and removing the extra dimension to make it only 28 x 28 instead of 1 x 28 x 28.

Note that to convert a PyTorch tensor to numpy array, simply use the .numpy() method available on PyTorch tensor objects.

In [47]:

import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure(figsize=(40,10))
ax = fig.add_subplot(2,10, 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(imgs_[0].numpy()), cmap='gray')

Out[47]:

<matplotlib.image.AxesImage at 0x7f62408487b8>

Now let's see if our model predicts the image correctly.

In [56]:

net3.predict(imgs_[0])[1].item()

Out[56]:

7

So our evaluate and predict methods seem to be working fine, and we are able to score around 97% on the test set with all individual accuracies in the mid 90s. This is pretty good given that we have only trained for 15 epochs for less than 3 minutes and are using a simple fully connected network without any fancy CNN stuff. Additionally, we have refactored our code into classes, utility functions, and are also able to save and restore models as we require.

In part 2 of this tutorial, we will learn how to create a transfer learning class and train it on Kaggle's much larger dataset. Stay tuned!

Author
Farhan Zaidi

Farhan Zaidi is an artificial intelligence enthusiast and founder of a training and consultancy firm in the machine learning and deep learning space. He has over 25 years of experience as a software architecture, designer, and developer. Currently, he is involved in building a recommendation engine for an IPTV system and developing computer vision based systems that use deep learning for smart cities, security, and surveillance applications.

Technical Reviewer