Big Data Tools for Business: Real World Uses Cases Project

Part 1: Donald Trump Communication Analysis on Twitter

Trump.jpg

We are big data analyst for a communication agency who wants to analyze Donald Trump’s communication on Twitter.

The text file trump_tweets.txt contain an history of all trump tweets from 2009 to November 19th, 2020.

Using spark we will explore and analyze this data and get vizualization.

Summary

First let's import pyspark and create a SparkContext

We divide our study in two parts analysis based on the text of the tweets then on the date

1 - Text analysis

Step 1: Extracting text from the tweets

First we read the trump_tweets.txt file into the RDD rdd_trump_tweets

We split the lines to isolate the text from the date

We split the text into words to run our sentiment analysis

Here we create our own list of stop words: some words with no sense doesn't have to be part of the sentiment analysis

Step 2: Imports stop, negative and positive words and group them by key

The word "trump" is included in the positive words and, as it's the name of the president, it must be removed from the list so not to distort our study because it's one of the most used words in his tweets and retweets

Step 3: Extract top negative and positive words

Top positive words

We join the positive words and words from the tweets

We reduce by key and sort by descending order based on the count to get the occurrences of each positive word in the tweets

Top negative words

We follow the same procedure for negative words too

Step 4: Vizualization and sentiment score

We create a function to plot the positive and negative words

We can already see from this graph that Trump uses far more positive words in his tweets than negative ones, so we can expect a positive sentiment score. This might be surprising as Trump is seen as a very nervous president and we might expect a negative sentiment score: however, he probably uses a different tone on Twitter than we see him on TV, so we will probably have a positive score.

We compute the global sentiment score: the difference between the number of positive words and the number of negative words

Our previous assumptions were confirmed with a rather high sentiment score

Step 5: The contextual words

We study and count the occurrence of words that are neither negative nor positive and are called contextual words

We remove all the positive, negative and stop words from the tweets words to keep only the contextual words

We assign on each word a value of 1 and reduce by key to obtain the occurences of contextual words, sorted by descending order

We create a function to plot the contextual words

From the words most used by Trump, we can see that his speeches are mainly focused on politics (as we expected): frequently used words are "president", "people", "America", "democrats", "votes", "Obama", "China", "border", "Biden"....

We use a wordcloud to get another visualization

Step 6: Hot topics

We wanted to focus our study on what seemed to us to be hot topics, namely covid-19 then the presidential elections.

The covid-19

We filter and keep only the tweets containing the word covid

Then we join the tweets words with positive and negative words

We create a function to plot the positive and negative words

In this graph we can see that there is a higher number of negative words, which is understandable in view of the subject. It would seem that President Trump is less positive and he is worried in relation to the topic of Covid-19: very negative words are used in his tweets like "death", "virus", "hard", "risk", "emergency", "crisis".

Our previous assumptions were confirmed with a more neutral sentiment score.

We create a list of words that need to be removed from our studies. These words are linked to the word covid and are of no interest.

We create a function to plot the most common contextual words

All the contextual words are related to Covid-19, the health emergency, the possible vaccine and also the influence the pandemic may have on the political world (elections).

The presidential elections

We filter and keep only the tweets containing the word election

We create a function to plot the positive and negative words

There is a mix of positive and negative words related to the election, Trump does not have a firm opinion about it and most importantly a more thorough study should be done to distinguish the 2016 and 2020 elections.

We create a function to plot the most common contextual words related to the topic election.

This graph shows that Trump mentioned the 2016 election more than the 2020 election, probably his campaigning was different in the two elections and influenced the final result. He also had to make himself known more in 2016 than 2020.

Top '#'

We make an analysis of the most used hashtags in Trump's tweets

Among the most used hashtags we find many important slogans used by Trump in the last decade: #trump2016 certainly related to the election, #makeamericagreatagain (and the acronim #maga) related to the 2016 election campaign and many other hashtags related to the world of politics.

Top @

We do an analysis of the most tagged people in Trump's tweets

The most quoted person in Trump's tweets is himself (thanks to other people's retweets) and then mainly TV channels, newspapers or other people related to politics stand out.

2 - Time analysis

We want to do an analysis of the years and hours of the day when Trump tweeted most: the problem is that some tweets are split over several lines, but since we are only interested in the dates, we filter out the lines where we only have the dates, and then extract them and delete the part of the text that comes before.

The number of dates present, also indicates the total number of tweets Trump has made from 2009 to 2020

We identify the month and year in which Trump tweeted the most

The period in which Trump tweeted the most was the autumn of 2020: although he was already president at that time, the election was imminent and Trump wanted to make his voice heard by the American people on Twitter as well.

We visualize the hours of the day in which Trump tweeted the most

Trump prefers to tweet in the early afternoon or in the evening: surprisingly, there is a fairly high number of tweets even at night, while he tweets very little in the morning.

We visualize the years in which Trump tweeted the most

The years in which Trump tweeted the most were the years before the election, whereas he tweeted very little in the first years he was in office (2009, 2010) when social media were not yet mass media, and he tweeted very little even in the two years after he was elected president (2017 and 2018)