So, why should you care about word embeddings? Word embeddings can be used for variety of tasks in deep learning, such as sentiment analysis, syntactic parsing, named-entity recognition, and more. They can also:
- Provide a more sophisticated way to represent words in numerical space by preserving word-to-word similarities based on context.
- Provide a measure of similarity between words or phrases.
- Be used as features in classification tasks.
- Improve model performance.
Many interesting relationships can be uncovered using word embeddings, the most famous example being king - man + woman = queen. So, let's go ahead and try word embeddings out on the Amazon Fine Foods dataset!
Let's start with reading in our corpus. I'll be using a module available to DataScience.com customers called the Voice of the Customer Playbook that contains code for common text processing and modeling tasks, such as removing bad characters or stemming. It can also be used for topic modeling and opinion mining tasks.
import pandas as pd
import matplotlib.pyplot as plt
from gensim import corpora
from gensim.models import Phrases
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from ds_voc.text_processing import TextProcessing
# sample for speed
raw_df = raw_df.sample(frac=0.1, replace=False)
# grab review text
raw = list(raw_df['Text'])
Let's tokenize our sample and do usual cleaning steps: removing bad characters, stop words, and stemming:
# word2vec expexts a list of list: each document is a list of tokens
te = TextProcessing()
cleaned = [te.default_clean(d) for d in raw]
sentences = [te.stop_and_stem(c) for c in cleaned]
Now we are ready to fit a model. Here we are using the gensim implementation:
from gensim.models import Word2Vec
model = Word2Vec(sentences=sentences, # tokenized senteces, list of list of strings
size=300, # size of embedding vectors
workers=4, # how many threads?
min_count=20, # minimum frequency per token, filtering rare words
sample=0.05, # weight of downsampling common words
sg = 0, # should we use skip-gram? if 0, then cbow
hs = 0
X = model[model.wv.vocab]
Voila! That was easy. Now, what kind of questions can you ask this model? Recall that the words occurring in similar contexts will be deemed as similar to each other, thereby forming word "clusters." Start with a simple exmaple by checking what kind of terms the model considers similar to "peanut":
[(u'butter', 0.9887357950210571), (u'fruit', 0.9589880108833313), (u'crunchi', 0.9448184967041016), (u'potato', 0.9327490329742432), (u'textur', 0.9302218556404114), (u'nut', 0.9176014065742493), (u'tasti', 0.9175000190734863), (u'sweet', 0.9135239124298096), (u'appl', 0.9122942686080933), (u'soft', 0.9103059768676758)]
Not bad; the most similar token is, of course, "butter." The other tokens indicated here also seem intuitive. "Fruit" and "apple," however, are possibly the result of customers mentioning fruit jams and spread when they mention peanut butter. Let's try a few more examples:
[(u'k', 0.8691866397857666), (u'starbuck', 0.862629771232605), (u'keurig', 0.85813969373703), (u'decaf', 0.8456668853759766), (u'blend', 0.840221643447876), (u'bold', 0.8374124765396118), (u'cup', 0.8330360651016235), (u'brew', 0.8262926340103149), (u'espresso', 0.8225802183151245), (u'roast', 0.812541127204895)]
[(u'refresh', 0.9925233721733093), (u'caramel', 0.9756978750228882), (u'pepper', 0.9739495515823364), (u'cherri', 0.9737452268600464), (u'slightli', 0.9729464054107666), (u'cinnamon', 0.9727376699447632), (u'lemon', 0.9724155068397522), (u'blueberri', 0.9717040061950684), (u'sour', 0.971449613571167), (u'cocoa', 0.9712052345275879)]
Note that the most similar words to "spice" are "refresh" and "caramel," even though "caramel apple spice," for example, is a popular combination. You can extend the analogies to not only to include multiple words in the context, but to exclude certain words. For example, I am interested in a protein-rich snacks, but I don't want to take a protein supplement:
print (model.most_similar(['snack', 'protein'], negative=['supplement']))
Here's what the model spits out:
[(u'chip', 0.7655218839645386), (u'bar', 0.7496042251586914), (u'potato', 0.7473998069763184), (u'peanut', 0.741823136806488), (u'feel', 0.7318717241287231), (u'cereal', 0.7217452526092529), (u'nut', 0.716484546661377), (u'butter', 0.7104200124740601), (u'healthi', 0.7084594964981079), (u'low', 0.7055443525314331)]
The model found that "chips," "bars," "peanuts," "nut," and terms like "healthy" live in the same word cluster as "snack" and "protein," but not in the same cluster as "supplement."
Of course, these are all simple examples, but don't forget that we trained the model on a small sample of reviews. I would advise you to extend the dataset in order to build more robust word embeddings.