Using Natural Language Processing to Analyze Customer Feedback in Hotel Reviews

 

Prerequisites

Experience with the specific topic: Intermediate

Professional experience: Some industry experience

Introduction

With a multitude of social feedback platforms available, such as Yelp or TripAdvisor, more and more customers turn to reviews before making their next decision on a hotel, restaurant, or a service. According to a TripAdvisor study, 90% of travelers say reviews are influential when deciding where to stay.

To a business, customer reviews can contain loads of information that help to identify and fix existing issues, reward loyal customers, and win back dissatisfied ones. However, reading through hundreds of reviews is a tedious and time-consuming task. Instead, we can apply natural language processing to quickly summarize and sort reviews, delivering relevant information for appropriate departments to act on. In this blog post we demonstrate how to build and utilize such a model on a dataset of reviews for a luxury hotel.

Business Impact

Summarizing and categorizing reviews by their subject matter can save hundreds of reading hours. Instead, one can train a model on historical reviews and extract main topics mentioned. Then these topics can be used to quickly filter reviews and flag the ones requiring action. For example, the output of these insights can be used to address concerns about concierge service or to develop an approach for reaching out to unhappy customers. New incoming reviews can be automatically processed by the model, tagged with a topic, and passed along to appropriate channels for action. This also enables addressing issues quickly, and potentially retaining customers who otherwise would have churned.

The goal of this analysis is twofold:

  • Reduce number of human hours spent on review reading and categorizing
  • Increase customer retention by addressing concernes in a timely manner

Saving human hours of reading through reviews will help to improve other customer satisfaction metrics such as:

  • Response time
  • Number of review replies and complaints addressed
  • Customer likelihood to recommend (net promoter score)
  • Customer ratings
  • Customer retention

Instead of reading and trying to identify issues, the focus can shift to issue resolution and personalizing customer experience.

Method

We will use a technique called topic modeling, more specifically Latent Dirichlet Allocation LDA. LDA is one of the most widely used techniques for topic modeling. It is a generative probability model, meaning there is an assumed underlying probability distribution which generates the set of documents on hand. This makes it an appealing modeling option, as the generative model allows for scoring unseen documents.

LDA uses a hierarchial Bayesian model to model each document as a mixture of some unobserved topics. The goal of the algorithm is to discover these topics which are represented by some distribution over words.

In a nutshell, Latent Dirichlet Allocation assumes the following process for generating documents:

For each document in corpus D:

  • Choose $N \sim Poisson(\epsilon)$
  • Choose document-level topic parameter $\theta \sim Dir(\alpha)$. This will control topic mixture within the documents
  • For each of the N words Wn:
    • Choose a topic $z_n \sim Multinomial(\theta)$ according to the mixture distribution determined above
    • Choose a word $w_n \sim p(w_n|z_n, \beta)$ from that topic

Note that Poisson assumption is there for completeness and as Blei puts it "Poisson assumption is not critical to anything that follows and more realistic document length distributions can be used as needed" (Blei 2003).

LDA requires number of topics to be specified as an input, and produces a probability distribution over words as a way to characterize each topic. The trained model can also be used to extract probability distribution over topics for each document.

Limitations

  • Algorithm requires number of topics K to be supplied. This parameter should be chosen or tuned to optimize some metric of topic quality, such as log-perplexity or topic coherence.

  • LDA is a "bag of words" model - all words are assumed to be interchangeable in a document. So this method does not guarantee that topics will appear coherent to a human. In fact, human intervention is often required to interpret LDA output and assign a more sensible label to each topic. We will see how this is the case later in the analysis.

Data

Dataset used here is 3,000 reviews for a luxury hotel from TripAdvisor, with each review averaging 372 words long.

To prepare raw reviews for modeling, the following data transformations were used:

  • Removed bad characters such as "~", "/", etc
  • Text converted to lower case
  • Stop words removed
  • Filtered rare words occurring less than 5 times
  • Filtered too frequent words occurring in more than 20% of reviews

Utility functions used for these transformations are included below.

%matplotlib inline
import sys
sys.path.append('../')

import os
import pandas as pd
import numpy # need for gensim with no alias
import matplotlib.pyplot as plt
import string
import re
import logging 
from gensim import corpora 
from gensim.models import Phrases
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')


import warnings
warnings.filterwarnings('ignore')


# Utility Functions
def default_clean(text):
    '''
    Removes default bad characters
    '''
    if not (pd.isnull(text)):
        text = filter(lambda x: x in string.printable, text)
        bad_chars = set(["@", "+", '<br>', '<br />', '/', "'", '"', '\\', 
                        '(',')', '<p>', '\\n', '<', '>', '?', '#', ',', 
                        '.', '[',']', '%', '$', '&', ';', '!', ';', ':', 
                        '-', "*", "_", "=", "}", "{"]) 
        for char in bad_chars: 
            text = text.replace(char, " ") 
        text = re.sub('\d+', "", text) 
        
    return text

def stop_and_stem(text, stem=True, stemmer = PorterStemmer()):
    '''
    Removes stopwords and does stemming
    '''
    stoplist = stopwords.words('english')

    if stem:
        text_stemmed = [[stemmer.stem(word) for word in document.lower().split()
                         if word not in stoplist] for document in text]
    else:
        text_stemmed = [[word for word in document.lower().split()
                 if word not in stoplist] for document in text]

    return text_stemmed


def make_corpus(parsed_text, bigrams=True, filter_extremes=False, below=5, above=0.1):
    '''
    Prepares corpus and dictionary with options for removing outliers or using bigrams
    '''

    if bigrams:
        bigrams = Phrases(parsed_text)
        corpora_dict = corpora.Dictionary(bigrams[parsed_text])
        parsed_text = bigrams[parsed_text]
    else:
        corpora_dict = corpora.Dictionary(parsed_text)

    # Filter the dict to remove redundant words
    if filter_extremes:
        print("Size of dict before filter: %s") % len(corpora_dict)
        corpora_dict.filter_extremes(no_below=below, no_above=above)
        print("Size of dict after filter: %s") % len(corpora_dict)


    # Convert the cleaned documents into bag of words
    corpus = [corpora_dict.doc2bow(t) for t in parsed_text]
    
    return corpora_dict, corpus


def data_transformation(input_data, bigrams=True, stem_flag=True, filter_extremes=False, below=5, above=0.1):
    '''
    Combines all data transformation steps: clean, drop stopwords, stem, make corpus
    '''

    clean_reviews = [default_clean(d).lower() for d in input_data]
    stemmed = stop_and_stem(clean_reviews, stem=stem_flag)
    dictn, data = make_corpus(stemmed, bigrams=bigrams, filter_extremes=filter_extremes, 
                              below=below, above=above)

    return data, dictn 
 f = open('../'+dir_path,'r')
raw=f.read()
f.close()

# process the text file
lines = raw.splitlines() # split on lines and carriages \n\r

reviews = [line.strip('') for line in lines if '' in line]
 # clean up raw reviews and prepare dataset for model
corpus, dictionary = data_transformation(reviews, bigrams=False, stem_flag=False, filter_extremes=True, below=5, above=0.2)

Size of dict before filter: 20465
Size of dict after filter: 5754

Training the Model

We will use gensim, nltk, and pyLDAvis for building and visualizing the model. These libraries are part of the NLP dependency collection on the DataScience.com Platform, and are already pre-installed within a working session.

Note that when constructing the model, one is faced with several modeling choices that can help improve topic quality:

  • Including bigrams or trigrams to the vocabulary
  • Filtering rare words
  • Filtering too often occurring words, that might not carry much information
  • Specifying training parameters, such as number of passes through corpus

This can all be done with utility functions included here for convenience. First, let's start out by choosing the number of topics (K) to cluster the reviews into. We will iterate through = 2,..., 20 fit lda model with each K in the training set, and calculate topic coherence on test set. Topic coherence is a metric of topic quality, found to be correlated with human judgement. Higher topic coherence is better. We see that in this case, K = 3 has the highest coherence.

 # some utility functions for running comparisons

def tuple_ith_element(lst, element=1):
    return [x[element] for x in lst]
    
    
def split_test_train(crps, test_prop=0.7):
    in_train = numpy.random.choice(len(crps), size=int(len(crps)*test_prop), replace=False)
    tr = [crps[i] for i in  in_train]
    ix = range(len(crps))
    s = set(in_train)
    in_test = [i for i in ix if i not in s]
    ts = [crps[i] for i in in_test]

    return tr, ts

    
def search_param_space(corpus_iter, dict_iter, parallel=False,
                       metric = 'coherence', ntopics=[5], num_passes=1):
    ''' fit LDA model, with faster option using multicore implementation
        and calculate topic coherence for each K
    '''
    result = dict()
    
    if metric not in ['coherence', 'log_perplexity']:
        raise ValueError("Metric can only be 'coherence' or 'log_perplexity'")
    
    # split into training and testing sets
    train_corpus, test_corpus = split_test_train(corpus_iter)
    
    # iterate through topic list and fit a model for each
    for t in ntopics:
        print "Fitting %s topics..."%t           
            if parallel:
                lda_mod = LdaMulticore(train_corpus, id2word=dict_iter, num_topics=t, passes=num_passes)
            else:
                lda_mod = LdaModel(train_corpus, id2word=dict_iter, num_topics=t, alpha='auto', passes=num_passes)
            
            if metric=='coherence':
                topic_coh = tuple_ith_element(lda_mod.top_topics(test_corpus, num_words=20))
                result[str(t)] = numpy.mean(topic_coh)
                
            elif metric=='log_perplexity':
                result[str(t)] = lda_mod.log_perplexity(test_corpus)
             
    return result
 topic_search = numpy.arange(2, 20)
print topic_search

topic_coherence = search_param_space(corpus, dictionary, parallel=True,
                                     metric='coherence', ntopics=topic_search, num_passes=10)
 plt.rcParams['figure.figsize'] = (10,5)
x = tuple_ith_element(sorted(topic_coherence.items(), key=lambda x: x[1]))
labels = tuple_ith_element(sorted(topic_coherence.items(), key=lambda x: x[1]),0)
xticklabels = ['K='+t.replace('_auto','') for t in labels ]
plt.plot(range(len(x)), x)
plt.xticks(range(len(x)), xticklabels, rotation=70)
plt.ylabel('Topic Coherence')
plt.show() 

Let's go ahead and train an LDA model using a gensim package. Here, the number of topics is specified as 3, although it is advised to run your model for few values of K and inspect resulting topics and assignments to make the final decision.

numpy.random.seed(seed=44)

# number of topics
K=3

# Run LDA model to extract topics
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=K, alpha='auto', passes=10)

Let's take a look at the found topics and distribution of words for each:

# let's display the topics, represented by top 10 most probable words
lda.show_topics(K, num_words=10, formatted=False)
[(0,
  [(u'honeymoon', 0.0058731279787742384),
   (u'amazing', 0.0048606030168368243),
   (u'worth', 0.0040833844400031457),
   (u'perfect', 0.0038467153797996337),
   (u'dominican', 0.0037614657629653648),
   (u'enjoyed', 0.0036999145132155056),
   (u'feel', 0.0035744270333553669),
   (u'enjoy', 0.0034788356289153278),
   (u'spanish', 0.0034110479694000411),
   (u'fantastic', 0.0033365940208586941)]),
(1,
  [(u'told', 0.0057331904996217519),
   (u'desk', 0.0053025855901582932),
   (u'front', 0.0052319501972157801),
   (u'another', 0.0038407030342968765),
   (u'travel', 0.0036634423857131145),
   (u'star', 0.0036148921551896495),
   (u'problem', 0.0036022267703328869),
   (u'resorts', 0.0035453574325009628),
   (u'said', 0.0035424860988715313),
   (u'however', 0.0034485480827060877)]),

 (2,
  [(u'club', 0.0040233074323935145),
   (u'lunch', 0.0034729484148856063),
   (u'pretty', 0.0032145057465365521),
   (u'bring', 0.0031766637663410849),
   (u'problem', 0.0031154616761667597),
   (u'try', 0.0030271740164772016),
   (u'buffet', 0.0029188134181129931),
   (u'use', 0.0028724095735407048),
   (u'right', 0.0028384468513518168),
   (u'lot', 0.0027657897746246529)])]

Not bad, looks like there are some coherent themes within the topics.

Here, let's take a closer look at the words and phrases for each topic. These are the terms that are most relevant to each topic category, i.e. terms that are more likely to appear within that topic versus in the rest of the reviews.

More formally, the relevance of term w to topic k is defined as:

where λ is a user-specified parameter determining specificity of the term within the topic, and ϕ is the probility of occurence of w in topic k, and p is the marginal probability of term w within the corpus. Check out this paper more details.

pyLDAvis can be used to calculate term relevancy, with parameter λ controlling the order of terms for a selected topic. For example, topic 1 can be characterized by terms such as honeymoon, amazing, perfect. Decreasing λ brings up other positive terms such as romantic, incredible, etc. For brevity, we can assign label Happy Honeymooners to this topics. Similarly, other topics can also be assigned a label based on what terms customers tend to mention. Looks like historical reviews fall into one of the 3 main topics:

  1. Happy customers, often honeymooners or wedding guests
  2. Customer complaints
  3. Restaurant and bar related feedback

Assigning these descriptive short-hand labels is somewhat of a subjective process. It involves closer examination of topic terms and examples of reviews assigned to a particular topic.

 

# get term relevance
viz = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)

name_dict = {   0: "Happy Honeymooners", # 1 on the chart 
                1: "Customer Complaints",    # 2 on the chart
                2: "Restaurant and Bar Feedback",  # 3 on the chart
            }  

for_viz = {}

# specify parameter
lambda_ = 0.4

viz_data = viz.topic_info
viz_data['relevance'] = lambda_ * viz_data['logprob'] + (1 - lambda_) * viz_data['loglift']

# plot the terms
plt.rcParams['figure.figsize'] = [20, 11]
fig, ax_ = plt.subplots(nrows=1, ncols=3)
ax = ax_.flatten()

for j in range(lda.num_topics):       
    df = viz.topic_info[viz.topic_info.Category=='Topic'+str(j+1)].sort_values(by='relevance', ascending=False).head(30)  
    
    df.set_index(df['Term'], inplace=True)
    sns.barplot(y="Term", x="Freq",  data=df, ax = ax[j])
    sns.set_style({"axes.grid": False})

    ax[j].set_xlim([df['Freq'].min()-1, df['Freq'].max()+1])
    ax[j].set_ylabel('')
    ax[j].set_title(name_dict[j], size=15)
    ax[j].tick_params(axis='y', labelsize=13)

Alternatively, one can visualize terms and their relevancy using interactive charts from pyLDAvis. The size of each circle on the plot below represents how prevalent the topic is among reviews, and the lambda slider controls the relevance of terms to a particular topic. For example, the figure below is displaying the top 30 most relevant terms for topic we labeled "Customer Complaints":

Let's see how many reviews fall in each of the 3 categories. Customer complaints make up about 20% of reviews, and the most frequent topics are those from happy customers. You can imagine a scenario where reviews containing complaints can be especially relevant to customer satisfaction and retention teams, and reviews pertaining to food and restaurants should be directed to food service team. Tagging incoming reviews with this topic will help to reduce the number of hours needed to read and sift through the vast amount of information.

# Assign each review to a topic
scored = lda[corpus]
topic_prob = map(lambda x: max(x, key=lambda item: item[1]), scored)
scored_reviews = pd.DataFrame(zip(reviews, topic_prob), columns=['Review', 'Main_Topic'])
scored_reviews[['Topic', 'Prob']] = scored_reviews['Main_Topic'].apply(pd.Series)
scored_reviews['Topic Name'] = scored_reviews['Topic'].map(name_dict)
df = scored_reviews['Topic Name'].value_counts(normalize=True)

plt.rcParams['axes.facecolor'] = 'white'
ax = df.plot(kind='barh', figsize=[8,6], title='Reviews Per Category', color='#33A5D3')

highlight = 'Customer Complaints'
pos = df.index.get_loc(highlight)

ax.patches[pos].set_facecolor('#aa3333')

 

Making Use of the Model

Now that we have trained the model and labeled review categories, the final and perhaps most important step is to utilize model results. In this particular use case, one could consider:

  • Feeding new review summaries along with labels based on found topics into a BI tool to be presented to customer service representatives
  • Problematic reviews can be filtered and brought to attention
  • Topic categories can be tracked through time to monitor trends in customer feedback

This application of topic modeling is meant to show how unsupervised models can save time by summarizing text into managable categories. Every data set is different, and often topic modeling is just the first step in making sense of text data. However, even with a simple model as demonstrated here, one can start gaining valuable insights about customers by quickly summarizing and visualizing reviews.

Citations

Hongning Wang, Chi Wang, ChengXiang Zhai and Jiawei Han. Learning Online Discussion Structures by Conditional Random Fields. The 34th Annual International ACM SIGIR Conference (SIGIR'2011), P435-444, 2011.