Twitter Analysis

Overview and Motivation

Social networks have changed the way people communicate with each other. Today the world is lot more connected than it was before. People are also more open towards voicing their opinions regarding issues of their interests. These issues can be related to sports, careers, politics, art etc.

Social network websites like Twitter and Facebook provide great resources to learn about the preferences and interests of people in different demographics. Based on those interests and preferences, it’s easy to know about the sentiments and the location of the people. The data collected from these social networking websites is mostly in the form of raw text, but based on the context it’s easy to know the sentiments and opinions of people residing in different demographics. Our goal for the project is to gather, group and analyze such text data and provide meaningful insights in the form of visualizations.

For our project we have decided to work with twitter data which was collected for over a week. Tweets are complex and have many different attributes. It would be very interesting to visualize such data and discover any patterns in terms of how people tweet, where they tweet and when they tweet.

Related Work

We worked with a lot of numerical data during the class homework assignments, but never worked with text data. We felt that using the text data for our project would also give us the opportunity to think outside the box and come up with visualizations that draw meaningful insights on the underlying dataset. Also, since everyone uses social media, we felt that the text data gives us to have a larger target audience. Our main goal is to develop a tool which can be used by people of all age groups.

Visualizations are often used to understand the data on social media. For example, https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/. This website can give a sense of people’s opinions on current events. We also would like to create visualizations which will help data make sense for people of all age groups and connect the data with the events and provide a bigger picture for everyone.

Questions

1. How many people are tweeting in different states depending on day?
2. What are the trending topics on twitter?
3. Are there sentiment value differences based on different states in United States?

By using the visualizations for twitter data, we can learn a summary of situations happening throughout the country as well as how twitter users feel on these situations from the state-by-state sentiment analysis. Also, we can learn the times when twitter users most used and least used based on time of the day in different states.

These visualizations can highly benefit companies as they can learn the time most twitter users are active, things people are interested, etc., and answer the questions for whom they should advertise their products, when and where, etc. These can also give information to anyone who are interested in learning what’s happening on twitter in a minute.

Data

The twitter data comes from a project of the cse530s class. A script has been implemented to programmatically stream tweets from twitter streaming APIs. Approximately 1 million tweets have been downloaded for over a week. We are streaming more tweets from twitter and hope to discover more interesting results. To process the data, we implemented scripts to extract hashtags from the tweets. To find meaningful features, we also leveraged database and SQL queries to filter and aggregate the tweets.

Links: Twitter Streaming API: https://dev.twitter.com/streaming/overview NHGIS: https://www.nhgis.org/

Source: The twitter data comes from a project of the cse530s class. A script has been implemented to programmatically stream tweets from twitter streaming APIs. Approximately 1 million tweets have been downloaded for over a week. We plan to use the same script to stream the data for a longer period of time because we want to visualize its time dependence. We will also need county and state boundary data to do GIS visualization for the tweets. This data is also from cse530s class and it is downloaded from NHGIS.

Exploratory Data Analysis

Initialliy we decided to use D3 Datamap with bubbles, where the size of the bubble represents the number of tweets and the color of the states represent the sentiment for that state. We also decided to provide a timeline feature, where the user can filter the map based on the date. Later we replaced the date slider with a themeriver which displays how the tweets changes based on a particular hashtag. The benefits of this helps in identifying peaks in the number of tweets based on the particular hashtags. For ex: there is a peak in the themeriver for Easter day 2017-04-16.

The user can interact with the visualization based on several preferences like the map, the bubbles, the themeriver and the search box. We also added a histogram which displays the number of positive, negative and neutral tweets when hovered over a particular state. The map, themeriver, the word clouds and the bar plots are bound to each other. This gives the user more freedom to narrow down the visualization based on his/her preference.

Design Evolution

We have kept the same design layout as the one mentioned in the proposal. We were planning to implement a dashboard with 3 pages in total. The first page would provide the summary of the project, the second the map visualization and the third implementing the world clouds. Later we reduced the number of pages to 2 as we decided to include the map and the word cloud on the same page. We decided to remove it, because we felt having the map bound to the word cloud would help the user to interact with the map and the word cloud at the same time.

We designed to use a slider to present a timeline for the map, but we changed to a themeriver later as the slider does not provide much information. With themeriver, we present the amount of tweets for the most used hashtags over the week.

Implementation

We managed to implement all the features discussed in the first presentation. We did make a few changes along the way, but that was mainly to improve the user experience. We used D3, HTML, Javascript, CSS and Python to build our dashboard. For the Datamap the color of the state will represent the sentiment for that state and the size of the bubble will represent the number of tweets. When the user hovers over the bubble a histogram will be generated which displays the number of positive, negative and neutral tweets along with a word cloud for that state. The user also has the options to search for a particular hashtag in the search box. Our initial challenge was to reduce the response time required to generate the dynamic visualization. This helps the user to stay engaged with the dashboard. Later we also decided to implement a uniform color scheme for the whole dashboard which includes the summary page, the map, themeriver and histogram. We also added a results page and a process book to document our findings and learning experience from the project

Evaluation

From the data we were able to analyze the sentiments of people residing in different states. We found that people tend to tweet a lot during weekends. We managed to answer all our questions. We were able to get the number of tweets from each state and categorize the tweets into positive, neutral and negative. We were also able to identify the most trending topics based on the word clouds. Our visualization gives the user the freedom to interact with the map, word clouds, histogram and themeriver at the same time. The user can also customize his/her search preference using the search box. We could further improve our visualization by collecting more data for several months. This would help us to learn about the sentiment and nature of tweets for different seasons.

Dataset Summary

Map

Results

Summary

Hypothesis Answers

Process Book

Overview and Motivation

Related Work

Questions

Data

Exploratory Data Analysis

Design Evolution

Implementation

Evaluation

Contact

Wint Yee Hnin

Yongzheng Huang

Krushnaraj Kamtekar