Where to eat in Las Vegas:
Topic Models, Networks, & Yelp Reviews

Sooraj Subrahmannian and Kunal Kotian

Image source: https://www.flickr.com/photos/stratospherehotelcasino/4303487863

Background and Motivation

We wanted to experiment with network visualization, particularly in an interactive setting. We also think topic modeling is fun, and hence decided to take a stab at using topic modeling to construct and visualize two networks. These networks are driven by reviews of restaurants on Yelp. We thought it would be interesting to see if something as unstructured as customer opinions would allow us to identify some structure/communities among a network of restaurants and review topics.

The Data

We used a dataset of reviews released by Yelp, filtered it down to restaurants in Las Vegas, with a minimum of 100 reviews each. This left us with a dataset of reviews from 237 Las Vegas restaurants. Each review in the filtered dataset was processed using SpaCy and regex (in Python) to get the final tokenized text.

A Network of Restaurants

We trained a Latent Dirichlet Allocation (LDA) model on the tokenized reviews and used it to determine the ‘topic vector’ for each restaurant. This topic vector is a probability distribution of topics emerging from the reviews. This can be thought of as a way to represent all reviews for a restaurant with a probability distribution over topics. Using these topic vectors, we calculated the Jensen-Shannon distance between each pair of restaurants. Note that the Jensen-Shannon distance is simply measures the similarity between two probability distributions. Finally, we applied a similarity threshold/cutoff to determine which nodes (restaurants) must be linked to form the network.

A Network of Topics Emerging from Reviews

Each 'topic' is a vector representing the probabilities of different words appearing in that topic. We obtained a collection of topic vectors for the aggregated Las Vegas restaurants review text corpus. Once again, we calculated pairwise Jensen-Shannon distances between topics and then applied a similarity cutoff to form linkages between nodes.

Interactive Network Visualizations

Explore a Network of Topics Emerging from Restaurant Reviews

  • Nodes (topics) are sized by token contribution. Token contribution for each topic is the number of word tokens present in a topic divided by the total number of word tokens in the entire text corpus.
  • Similarity cutoff slider allows manipulating the node linkages at different levels of cut-off. Setting the slider to a low cutoff value makes it 'easy' to link two nodes; setting it to a high value applies a stricter condition for establishing links between nodes.
  • Understand what the topics represent:
    • The second slider displays the top ‘characteristic words’ for each topic at different levels of 'relevance'.
    • Move the slider to the left to see characteristic words for a topic that are more closely related to the selected topic, i.e. rare words, almost exlcusively appearing in the selected topic.
    • Move the slider to the right to see words related to the selected topic that are also commonly found in other topics.

Explore a Network of Restaurants Based on Reviews

  • All restaurants were split into 2 groups depending upon whether they are located on the touristy Las Vegas ‘Strip’ or not.
  • The similarity cutoff slider works the same as it does for the topic network plot (scroll up for description).

What you can do with this

The two interactive visualizations above enable easy exploration of the networks of restaurants as well as review topics. We would like our readers to take these visualizations for a spin; feel free to get in touch with us if you would like to share any interesting observations.

You can contact us via LinkedIn:

The github repo below hosts all the code we used to build the visualizations:
https://github.com/kunal-kotian/visualizing_review_networks

Video Walkthrough of this Website