This article is republished with permission from the author from GumGum's tech blog. View the original here.
“Deep learning” is a phrase that gets thrown around a lot these days. It’s thought of as the “next big thing,” turning many heads and convincing those who initially thought of it as a bubble. It’s an active area of research and its applications are explored in just about any and everything you can think of.
To be clear, deep learning and neural networks are hardly anything new. They have been around for decades, but a previous lack of accessible, affordable computational power and available data had created a major bottleneck. Additionally, the advent of more sophisticated algorithms, cheaper computational powers from GPUs, and a barrage of data literally flowing in from all directions have led to what can be considered a deep learning renaissance.
One of the major advantages that neural networks provide over traditional learning algorithms can be understood in terms of performance and amount of data.
As you can see from the chart below, as the amount of data increases, traditional learning algorithms become stagnant while neural networks become better and better. Here I will assume that you have some basic understanding of how Neural Networks work, what the significance of a Neuron is and how layers of these neurons can be connected together to build complicated and powerful architectures. If not, I would highly recommend reading this blog first to get an idea.
Deep learning for natural language processing (NLP) is relatively new compared to its usage in, say, computer vision, which employs deep learning models to process images and videos. Before we dive into how deep learning works for NLP, let’s try and think about how the brain probably interprets text.
Take the following sentences as an example:
- “Hi, what’s up?” – This sentence holds some kind of meaning to it that the brain can map to a finite point in its infinite space of understanding.
- “Hi, how are you?” – Most of us will agree that this sentence is very similar to the previous one in terms of meaning and context, which means that the brain potentially has a very similar mapping for the two, i.e, it maps the points representing their meaning very close to each other.
- “Trying to understand DL!” – This sentence holds a meaning very different from the previous two so that its mapping is somewhat apart from the previous two points. This distance represents the difference in the overall meaning of the sentences.
In order to interpret and represent the brain’s mapping, we will say that it has an infinite dimensional space. That is, our brains are capable of processing and understanding an infinite number of minutely distinct concepts (recalling them is a different matter altogether), and each concept, sentence, document, etc., can be represented as a point in it. Or, in other words, an infinite dimensional vector.
Now let’s trim that thought a bit to say that this particular brain can only represent concepts in a 5-dimensional vector space (not a very smart brain). This means that every point is a 5-dimensional vector, a way of embedding data.
Since the smallest way to represent meaning in text is through words, let’s try to understand how we embed words.
Simply put, word embeddings allow us to represent words in the form of vectors. But these are not just any vectors. The aim is to represent words via vectors so that similar words or words used in a similar context are close to each other while antonyms are far apart in the vector space.
Here are some words represented in the diagram above:
- Cat and dog: Both cute animals, can be pets, have 2 eyes, 4 legs, and one nose
- Audi and BMW: Both powerful expensive German automobile companies
- USC and UCLA: Both premier universities located in Los Angeles
The words that make up a pair (ie., cat and dog) are very similar to each other and so they are mapped close together. However, the pairs themselves (i.e., cat and dog vs. Audi and BMW) are very different from each other and so they are mapped far apart.
Word embeddings can also be trained to identify relations such as:
- KING - MAN + WOMAN = QUEEN
- PARIS - FRANCE + ENGLAND = LONDON
There are a few techniques to determine if these embeddings are trained on a large enough corpus (i.e., Wikipedia), the most prevalent being word2vec and GloVe. We will go over both below.
Word2vec is an algorithm created by Google that utilizes two different types of model architecture for computing vector representations of words:
- Continuous Bag-of-Words (CBOW)
In the CBOW model, the aim is to fill in the missing word given its neighboring context. For example, when given the words “When”, “in”, “____”, “speak”, “French,” the algorithm is trained so that it knows that “France” is the obvious choice.
In the skip-gram model, the aim is to predict the context. For example, when given the word “France,” the algorithm is trained so that it predicts “When,” “in,” “speak,” and “French” as its neighboring words.
Let’s see what’s really happening in the above skip-gram diagram:
- Input Layer:
- Since we want to understand how to represent words, words is our input here. However, we can’t just feed a word in the string form to a neural network.
- The way we represent individual words is through a unique index mapping, i.e. each word has a unique index. For instance, if we have V distinct words, then our objective is to learn the representation of each of these V words/indexes in the form of some D dimensional vector.
- We one-hot encode the word indexes, i.e. each word goes from being an index into a V dimensional vector of zeroes, with 1 only at the index it represents.
- So “France” is represented by something like: [0, 0, 1, 0, 0,.......0]1*V where the index for “France” is 2.
- Projection Layer:
- Since our vocabulary size is V and we want to learn a D dimensional representation for each word in the vocabulary, the projection layer is a V*D matrix.
- Output Layer:
- This layer takes the output of the Projection layer and creates a probability distribution using a softmax function across the V words. The learning phase tunes the projection layer so that eventually words like “When,” “in,” “speak,” and “French” have a higher probability than other words in V when “France” is the input.
After the training phase, the projection layer is picked up and used as the word embeddings for the V words. The projection layer simply becomes a lookup table where each ith row represents the embeddings for the word with index i.
GloVe is an algorithm developed by Stanford researchers who argue that the ratio of word-word co-occurrence probabilities have the potential to encode some form of meaning.
Consider the following example:
Target words: ice, steam
Probe words: solid, gas, water, fashion
As you would expect, ice co-occurs with solid way more often than with gas, just like steam co-occurs more frequently with gas than with solid, and both ice and steam co-occur more frequently with water (being a common feature) and seldom with an unrelated word like fashion.
The ratios of each word shown in the third row of the diagram really starts to clear out the noise from non-deterministic features like water and fashion. Really large values correlate well with the properties of the numerator (ice) and really small values correlate well with the denominator (steam). This demonstrates just how simple ratios can represent meaning—in this case, thermodynamic phases—such as solid with ice and gas with steam.
The usage of GloVe is the same as word2vec, just with a different take on generating the V*D projection layer.
Word embeddings can be considered an integral part of NLP models. As an unsupervised learning technique, it can be trained on any corpus without the need for any human annotation. They provide a nice starting point for training any neural network, taking text as its input (you will have to convert that to indices first) since they capture similarity and relations like we saw in the examples above.
To sum it all up in one line:
When given a word, we represent it using an index and learn to represent it using a D-dimensional vector so that the mappings using those vectors capture some sort of relation and similarity in words.