While a three-year-old child has a lot to learn about the world, one thing that he is already an expert in is making sense of what he sees. Our society is more technologically advanced than ever. We’ve sent people to the moon, have phones that can talk to us, and have radio stations that can be customized to play the music of our choice. Yet our most advanced machines still struggle at interpreting what it sees.
We have prototype cars that can drive for us, but they cannot differentiate between a crumbled paper bag on the road and a stone that should be avoided. We have fabulous megapixel cameras, but we have not delivered sight to the blind. Security cameras are everywhere but they cannot detect when a child is drowning in the swimming pool.
As a society, we are collectively still blind when our machines are blind.
What is Computer Vision?
According to Prof. Fei-Fei Li, computer vision is defined as “a subset of mainstream artificial intelligence that deals with the science of making computers or machines visually enabled, i.e., they can analyze and understand an image.” Human vision starts at the biological camera’s “eyes,” which takes one picture about every 200 milliseconds, while computer vision starts by providing input to the machine. This makes it the best case for a class of algorithms called the Convolution Neural Network.
The basic building block of a neural network is a neuron, which loosely models the biological neuron. Similar to a biological neuron, an artificial neuron has input channels, a processing body, and output channel as shown in Figure 1.
Figure 1: Neuron – Basic Building Block of Artificial Neural Network
Just like the biological brain, these neuron-like nodes are connected in a way that receives input from one nodes and sends output to other nodes as shown in Figure 2.
Figure 2: A Small Neural Network
When hundreds or thousands of these nodes are organized in the same fashion as neurons in the biological brain, they form an Artificial Neural Network. In practice, these networks are so huge that they end up having billions of parameters, millions of nodes, and trillions of connections between them, resulting in a humongous model.
Computer vision enters the picture when we want to feed an image as an input with the intent that our machine will derive some intelligence out of it.
Figure 3: Computer Vision Process Flow
Convolutional Neural Network
Convolutional Neural Network is a class of deep feedforward neural networks (Figure 4) that is largely inspired by the biological system, where the connectivity pattern between neurons depicts where each individual cortical neuron responds to stimuli only in the restricted region of the visual field known as receptive field, i.e., restrictive subarea of the input. The cortical neurons of different fields overlap in such a way that they collectively represent the entire image.
In a Convolution Neural Network, each convolution neuron processes data only for its receptive field and they are organized in such a way that they collectively also represent the entire image. Moreover, both the biological visual system and the CNN have a hierarchy of layers that progressively extract more and more features. These layers are arranged in increasing order of complexity, starting from simple visual representations such as edges, lines, curves, etc., and gradually more complex representations such as faces, instances, etc. This results in the ability to understand complex images.
The architecture of a feedforward neural networks looks something like this:
Figure 4: Feed Forward Neural Network Architecture (Source)
Typically, a Convolution Neural Network has the following layers:
1. Convolutional Layer or Conv Layer
The convolutional layer applies the convolution operation upon the input, passing the result to the next layer. Each convolution operation emits the response of an individual neuron for its receptive field only.
2. Pooling Layer or Pool Layer
The pooling layer is introduced to reduce the spatial size of the output produced by the conv layer. This layer is introduced to detect the higher-level details from the input that is composed of lower-level building blocks, e.g., detecting corners from intersection of two edges.
3. Fully Connected Layer or FC Layer
The fully connected layer is the layer in which every node is connected to every node in its preceding and succeeding layer as shown in Figure 4. The primary purpose of the above two layers is to extract information out of an image. The fully connected layer then maps the extracted information to the respected output.
Apart from the above layers, CNNs can also have other components like a batch normalization layer, dropout, etc. These components can add a dazzling effect to the convolution neural network model in such a way that each of the layers are composed of learnable weights in which we need to initialize in the training process. A batch norm layer alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit gaussian distribution at the beginning of the training. In a similar way, dropout is an extremely effective and simple regularization technique which keeps only a few neurons active with some probability p.
The three main layers are stacked on top of each other so that the CNN architecture looks like the following:
Figure 5: Convolutional Neural Network Architecture
Computer Vision Applications
There are many computer vision applications out in the market. Below are just a few:
- Automatic inspection (image-based automated inspection), e.g., in manufacturing applications
- Assisting humans in identification tasks (to identify object/species using their properties), e.g., a species identification system
- Controlling processes (in a way of monitoring robots), e.g., an industrial robot
- Detecting events, e.g., for visual surveillance or people counting
- Modeling objects or environments (using drones can analyses about climatic factors that leads to change in vegetation, etc.), e.g., medical image analysis or topographical modeling
- Navigation, e.g., by an autonomous vehicle or mobile robot
- Organizing information, e.g., for indexing databases of images and image sequences
How to think about a Computer Vision Application
We can think of a computer vision application as finding tasks that requires human vision expertise and deriving some pattern out of it. We can also think that if a task can be automated, then we can work on developing a computer vision application.
We can think of a computer vision application by keeping the following points in mind:
- Adapt Existing Jobs and Look for Modification: Looking at the existing jobs for inspiration, we can devise a computer vision-based solution, e.g., computer vision can be used to detect the vehicles that break the traffic rules, read the number, and generate a fine slip for it. We can also look for already existing applications that are facing some problems and search for a better solution.
- Brainstorm: We can brainstorm with our colleagues, friends, and family to gather problems and check to see if they can be solved using computer vision.
- Research: Everything will ultimately boil down to research. There is no escaping research when you are looking for ideas. The research will not only help you get new app ideas but will also help you explore the market for already existing applications.
- Ted Talk by Prof. Fei-Fei Li
- Computer Vision on Wikipedia https://en.wikipedia.org/wiki/Computer_vision
- What is Deep Learning? https://in.mathworks.com/discovery/deep-learning.html
- Convolution Neural Network CS231n by Stanford