Word Embeddings: A Natural Language Processing Crash Course



The field of natural language processing (NLP) makes it possible to understand patterns in large amounts of language data, from online reviews to audio recordings. But before a data scientist can really dig into an NLP problem, he or she must lay the groundwork that helps a model make sense of the different units of language it will encounter.

Word embeddings are a set of feature engineering techniques widely used in predictive NLP modeling, particularly in deep learning applications. Word embeddings transform sparse vector representations of words into a dense, continuous vector space, enabling you to identify similarities between words and phrases — on a large scale — based on their context.

In this piece, I'll explain the reasoning behind word embeddings and demostrate how to use these techniques to create clusters of similar words using data from 500,000 Amazon reviews of food. You can download the dataset to follow along.

Word Embeddings: How They Work

In a typical bag-of-words model, each word is considered a unique token with no relationship to other words. For example, the words "salt" and "seasoning" will be assigned unique IDs even though they may frequently appear within the same context or sentence. Word embedding is a set of feature engineering techniques that map sparse word vectors into continuous space based on the surrounding context. You can think of this process as embedding a high-dimensional vector-word representation into a lower dimensional space.

This vector representation provides convenient properties for comparing words or phrases. For example, if "salt" and "seasoning" appear within the same context, the model will indicate that "salt" is conceptually closer to "seasoning," than, say, "chair."

There are several existing models for constructing word-embedding representations. Google's word2vec is one of the most widely used implementations due to its training speed and performance. Word2vec is a predictive model, which means that instead of utilizing word counts à la  latent Dirichlet allocation (LDA), it is trained to predict a target word from the context of its neighboring words. The model first encodes each word using one-hot-encoding, then feeds it into a hidden layer using a matrix of weights; the output of this process is the target word. The word embedding vectors are are actually the weights of this fitted model. To illustrate, here's a simple visual:

Word2vec includes two "flavors" of word embedding model: continuous bag of words (CBOW) and skip gram. The CBOW implementation looks at a sliding window of n around the target word in order to make a prediction. The skip-gram model, on the other hand, does the opposite — it predicts the surrounding context words given a target word. For more information on skip-gram models, check out this academic paper.

Getting Started

So, why should you care about word embeddings? Word embeddings can be used for variety of tasks in deep learning, such as sentiment analysis, syntactic parsing, named-entity recognition, and more. They can also:

  • Provide a more sophisticated way to represent words in numerical space by preserving word-to-word similarities based on context.
  • Provide a measure of similarity between words or phrases.
  • Be used as features in classification tasks.
  • Improve model performance.

Many interesting relationships can be uncovered using word embeddings, the most famous example being king - man + woman = queen. So, let's go ahead and try word embeddings out on the Amazon Fine Foods dataset!

Let's start with reading in our corpus. I'll be using a module available to DataScience.com customers called the Voice of the Customer Playbook that contains code for common text processing and modeling tasks, such as removing bad characters or stemming. It can also be used for topic modeling and opinion mining tasks.

# imports
%matplotlib inline

import os
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import string
import re
from gensim import corpora
from gensim.models import Phrases
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from ds_voc.text_processing import TextProcessing

# sample for speed
raw_df = raw_df.sample(frac=0.1,  replace=False)
print raw_df.shape

# grab review text
raw = list(raw_df['Text'])
print len(raw)

Let's tokenize our sample and do usual cleaning steps: removing bad characters, stop words, and stemming:

# word2vec expexts a list of list: each document is a list of tokens
te = TextProcessing()

cleaned = [te.default_clean(d) for d in raw]
sentences = [te.stop_and_stem(c) for c in cleaned]

Now we are ready to fit a model. Here we are using the gensim implementation:

from gensim.models import Word2Vec

model = Word2Vec(sentences=sentences, # tokenized senteces, list of list of strings
                 size=300,  # size of embedding vectors
                 workers=4, # how many threads?
                 min_count=20, # minimum frequency per token, filtering rare words 
                 sample=0.05, # weight of downsampling common words 
                 sg = 0, # should we use skip-gram? if 0, then cbow
                 hs = 0

X = model[model.wv.vocab]

Voila! That was easy. Now, what kind of questions can you ask this model? Recall that the words occurring in similar contexts will be deemed as similar to each other, thereby forming word "clusters." Start with a simple exmaple by checking what kind of terms the model considers similar to "peanut":

print (model.most_similar('peanut'))

[(u'butter', 0.9887357950210571), (u'fruit', 0.9589880108833313), (u'crunchi', 0.9448184967041016), (u'potato', 0.9327490329742432), (u'textur', 0.9302218556404114), (u'nut', 0.9176014065742493), (u'tasti', 0.9175000190734863), (u'sweet', 0.9135239124298096), (u'appl', 0.9122942686080933), (u'soft', 0.9103059768676758)]

Not bad; the most similar token is, of course, "butter." The other tokens indicated here also seem intuitive. "Fruit" and "apple," however, are possibly the result of customers mentioning fruit jams and spread when they mention peanut butter. Let's try a few more examples:

print (model.most_similar('coffee'))

[(u'k', 0.8691866397857666), (u'starbuck', 0.862629771232605), (u'keurig', 0.85813969373703), (u'decaf', 0.8456668853759766), (u'blend', 0.840221643447876), (u'bold', 0.8374124765396118), (u'cup', 0.8330360651016235), (u'brew', 0.8262926340103149), (u'espresso', 0.8225802183151245), (u'roast', 0.812541127204895)]

print (model.most_similar('spice'))

 [(u'refresh', 0.9925233721733093), (u'caramel', 0.9756978750228882), (u'pepper', 0.9739495515823364), (u'cherri', 0.9737452268600464), (u'slightli', 0.9729464054107666), (u'cinnamon', 0.9727376699447632), (u'lemon', 0.9724155068397522), (u'blueberri', 0.9717040061950684), (u'sour', 0.971449613571167), (u'cocoa', 0.9712052345275879)]

Note that the most similar words to "spice" are "refresh" and "caramel," even though "caramel apple spice," for example, is a popular combination. You can extend the analogies to not only to include multiple words in the context, but to exclude certain words. For example, I am interested in a protein-rich snacks, but I don't want to take a protein supplement:

print (model.most_similar(['snack', 'protein'], negative=['supplement']))
Here's what the model spits out:

[(u'chip', 0.7655218839645386), (u'bar', 0.7496042251586914), (u'potato', 0.7473998069763184), (u'peanut', 0.741823136806488), (u'feel', 0.7318717241287231), (u'cereal', 0.7217452526092529), (u'nut', 0.716484546661377), (u'butter', 0.7104200124740601), (u'healthi', 0.7084594964981079), (u'low', 0.7055443525314331)]

The model found that "chips," "bars," "peanuts," "nut," and terms like "healthy" live in the same word cluster as "snack" and "protein," but not in the same cluster as "supplement."

Of course, these are all simple examples, but don't forget that we trained the model on a small sample of reviews. I would advise you to extend the dataset in order to build more robust word embeddings.

Visualizing Word Vectors

To visualize resulting word embeddings, we can use a dimension reduction technique t-SNE, which can project resulting embedding vectors into 2 dimensions:

# visualize food data
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

plt.rcParams['figure.figsize'] = [10, 10]
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])       

We can see some clusters forming. Let's label each point and take a look at resulting clusters. You can also use the bokeh library for interactive plots that allow zoom.

from bokeh.plotting import figure, show
from bokeh.io import push_notebook, output_notebook
from bokeh.models import ColumnDataSource, LabelSet

def interactive_tsne(text_labels, tsne_array):
    '''makes an interactive scatter plot with text labels for each point'''
    # define a dataframe to be used by bokeh context
    bokeh_df = pd.DataFrame(tsne_array, text_labels, columns=['x','y'])
    bokeh_df['text_labels'] = bokeh_df.index

    # interactive controls to include to the plot
    TOOLS="hover, zoom_in, zoom_out, box_zoom, undo, redo, reset, box_select"

    p = figure(tools=TOOLS, plot_width=700, plot_height=700)

    # define data source for the plot
    source = ColumnDataSource(bokeh_df)

    # scatter plot
    p.scatter('x', 'y', source=source, fill_alpha=0.6,

    # text labels
    labels = LabelSet(x='x', y='y', text='text_labels', y_offset=8,
                      text_font_size="8pt", text_color="#555555",
                      source=source, text_align='center')


    # show plot inline 

Some of these make intuitive sense, such as the cluster containing decaf, bean, and french in the lower left corner:

Additional Tips

We can also improve the word embeddings by adding bigrams and/or part-of-speech tags into the model. Part-of-speech tagging can be useful in situations where the same word may have multiple meanings. For example, seasoning can be both a noun and a verb and has different meanings depending on the part of speech.

sent_w_pos = [nltk.pos_tag(d) for d in sentences]
sents = [[tup[0]+tup[1] for tup in d] for d in sent_w_pos]

model_pos = Word2Vec(sentences=sents,
                 sg = 0, 

X = model_pos[model_pos.wv.vocab]

We can also add bigrams to our model to allow for frequent word pairs to be grouped:

bigrams = Phrases(sentences)

model = Word2Vec(sentences=bigrams[sentences], 
                 sg = 0,
                 hs = 0

X = model[model.wv.vocab]

In Conclusion

Now you have some basic insight into word embeddings and can generate vectors using word2vec in gensim. This useful technique should be in your NLP toolbox, as it can come in handy in variety of modeling tasks. For more information, request a demo of the DataScience.com Platform to see what playbooks, like our Voice of the Customer Playbook, and modeling resources are available to our customers.