Predicting the US elections 2020 - a machine learning approach

Hey, how’s it going readers? As an introduction into the field of machine learning and data science, I thought it would be the perfect time to try and apply it some real world events.

As you’ve probably seen on television, the last couple of days the world is watching the US presidential elections. And the first twitter datasets are out for 2020, so I thought why not?

Let’s try some Natural Language techniques on the data like sentiment analysis and see what we can come up with.

It’s just a fun little experiment

Getting the dataset

The dataset is a little pain to get for yourself. The Twitter ID for each election tweet is taken from the github repo :

https://github.com/echen102/us-pres-elections-2020

However, I accessed the official API and retrieved the tweets by ID.


import os
tweet_id =[]
folderpath = "/content/drive/My Drive/Colab Notebooks/dataset/"
for folder in os.listdir(folderpath):
  try:
    for folder in os.listdir():
      try:
        for file in os.listdir():
          tweet_id.append(open(file, 'r').readlines())

You will have to do the same. The way I automated this process is using Tweepy.

Go to https://developer.twitter.com/en.html, and apply. Fill in the details, and you’ll have to wait a day or two for confirmation. Next, click the button “Create an app”, and you need to fill some details about your application. After completion, you will be given “Keys and tokens”.

You can easily fetch the tweets by the tweepy.API method. You’ll need to enter your own auth tokens:

import tweepy
import os

consumer_key = 'XX'
consumer_key_secret = 'XX'
access_token = 'XX-XX'
access_token_secret = 'XX'

auth = tweepy.OAuthHandler(consumer_key, consumer_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

api.get_status({"tweet_id":XXX})

Since Twitter forbids use of its API results for publishing externally without a bunch of prior permissions, here is an overview of the tweets extracted (I took a subset. Many were not found, and the total dataset was >4GB):

Preprocessing tweets

Next, we are going to clean up our tweets – special characters and other symbols especially, since they would be disruptive to any model.

Let’s import a few more import packages, and then we remove punctuation from our tweets.

import re

from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

punctuation += '΄´’…“”–—―»«' 

cache_english_stopwords = stopwords.words('english')

def tweet_clean(tweet):
    tweet_no_special_entities = re.sub(r'\&amp;amp;amp;amp;\w*;', '', tweet)
    tweet_no_tickers = re.sub(r'\$\w*', '', tweet_no_special_entities)
    tweet_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', tweet_no_tickers)
    tweet_no_hashtags = re.sub(r'#\w*', '', tweet_no_hyperlinks)
    tweet_no_punctuation = re.sub(r'[' + punctuation.replace('@', '') + ']+', ' ', tweet_no_hashtags)
    tweet_no_small_words = re.sub(r'\b\w{1,2}\b', '', tweet_no_punctuation)
    tweet_no_whitespace = re.sub(r'\s\s+', ' ', tweet_no_small_words) 
    tweet_no_whitespace = tweet_no_whitespace.lstrip(' ') 
    tweet_no_emojis = tweet_no_whitespace.encode('ascii', 'ignore').decode('ascii')
    tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True) 
    tw_list = tknzr.tokenize(tweet_no_emojis)
    list_no_stopwords = [i for i in tw_list if i not in cache_english_stopwords]
    tweet_filtered =' '.join(list_no_stopwords)
    return tweet_filtered

Next we call the above function on all the tweets iteratively:

clean_tweets=[]
for i in range(len(tweets)):
    if null.iloc[i]==True:
        pass
    else:
        clean_tweets.append(tweet_clean(tweets.iloc[i]))

Finally, let’s take a look at out new clean data:

data_clean=clean_tweets
data_clean
dat=pd.DataFrame(data_clean)

Now that our preprocessing is done, we can create some interesting word clouds to visualize our data.

Data visualization

We can plot simple word clouds which give great insight, especially when they’re so easy to implement:

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
text = " ".join(tweet_clean[tweet_clean['handle']=="Joe Biden"]['text'].tolist())
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

and same thing for Trump gives:

Sentiment Analysis

Next, we can do a rapid sentiment analysis with a little transformer magic. If you want a tutorial on transformers, I will write a series later. For now we can simply install and import:

!pip install transformers

from transformers import pipeline

Let’s run through all the tweets, and perform the analysis on each text. I have also created a nice progress counter:

sentiment_list = []
c = 0
counter_list = [int(tweet_clean.shape[0] * x/100) for x in range(100)]
for text in tweet_data['text']:
  if c in counter_list:
    print("progress = {} %".format(counter_list.index(c)))
  sentiment_list.append(sentiment(text)[0]['label'])
  c+=1

And that is it ! We can group each candidate by the positive and negative sentiments, and perhaps ascertain who is more preferred.

tweet_clean.groupby(['handle','sentiment']).count()

We can get percentages too:

tweet_clean.groupby(['handle','sentiment']).count().apply(lambda g: g / g.sum())

Biden V Trump Negative Positive Tweet Percents

Conclusion

Hopefully, this was a learning experience for everyone. Apparently, the candidates have more negative sentiment than positive. Try more fun things with this… until the results are out.

Predicting the US elections 2020 – a machine learning approach

Getting the dataset

Preprocessing tweets

Data visualization

Sentiment Analysis

Conclusion

Leave a Reply Cancel reply