Hey, how’s it going readers? As an introduction into the field of machine learning and data science, I thought it would be the perfect time to try and apply it some real world events.
As you’ve probably seen on television, the last couple of days the world is watching the US presidential elections. And the first twitter datasets are out for 2020, so I thought why not?
Let’s try some Natural Language techniques on the data like sentiment analysis and see what we can come up with.
It’s just a fun little experiment
Getting the dataset
The dataset is a little pain to get for yourself. The Twitter ID for each election tweet is taken from the github repo :
https://github.com/echen102/us-pres-elections-2020
However, I accessed the official API and retrieved the tweets by ID.
import os tweet_id =[] folderpath = "/content/drive/My Drive/Colab Notebooks/dataset/" for folder in os.listdir(folderpath): try: for folder in os.listdir(): try: for file in os.listdir(): tweet_id.append(open(file, 'r').readlines())
You will have to do the same. The way I automated this process is using Tweepy.
Go to https://developer.twitter.com/en.html, and apply. Fill in the details, and you’ll have to wait a day or two for confirmation. Next, click the button “Create an app”, and you need to fill some details about your application. After completion, you will be given “Keys and tokens”.
You can easily fetch the tweets by the tweepy.API method. You’ll need to enter your own auth tokens:
import tweepy import os consumer_key = 'XX' consumer_key_secret = 'XX' access_token = 'XX-XX' access_token_secret = 'XX' auth = tweepy.OAuthHandler(consumer_key, consumer_key_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth) api.get_status({"tweet_id":XXX})
Since Twitter forbids use of its API results for publishing externally without a bunch of prior permissions, here is an overview of the tweets extracted (I took a subset. Many were not found, and the total dataset was >4GB):
Preprocessing tweets
Next, we are going to clean up our tweets – special characters and other symbols especially, since they would be disruptive to any model.
Let’s import a few more import packages, and then we remove punctuation from our tweets.
import re from string import punctuation from nltk.corpus import stopwords from nltk.tokenize import TweetTokenizer punctuation += '΄´’…“”–—―»«' cache_english_stopwords = stopwords.words('english') def tweet_clean(tweet): tweet_no_special_entities = re.sub(r'\&\w*;', '', tweet) tweet_no_tickers = re.sub(r'\$\w*', '', tweet_no_special_entities) tweet_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', tweet_no_tickers) tweet_no_hashtags = re.sub(r'#\w*', '', tweet_no_hyperlinks) tweet_no_punctuation = re.sub(r'[' + punctuation.replace('@', '') + ']+', ' ', tweet_no_hashtags) tweet_no_small_words = re.sub(r'\b\w{1,2}\b', '', tweet_no_punctuation) tweet_no_whitespace = re.sub(r'\s\s+', ' ', tweet_no_small_words) tweet_no_whitespace = tweet_no_whitespace.lstrip(' ') tweet_no_emojis = tweet_no_whitespace.encode('ascii', 'ignore').decode('ascii') tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True) tw_list = tknzr.tokenize(tweet_no_emojis) list_no_stopwords = [i for i in tw_list if i not in cache_english_stopwords] tweet_filtered =' '.join(list_no_stopwords) return tweet_filtered
Next we call the above function on all the tweets iteratively:
clean_tweets=[] for i in range(len(tweets)): if null.iloc[i]==True: pass else: clean_tweets.append(tweet_clean(tweets.iloc[i]))
Finally, let’s take a look at out new clean data:
data_clean=clean_tweets data_clean dat=pd.DataFrame(data_clean)
Now that our preprocessing is done, we can create some interesting word clouds to visualize our data.
Data visualization
We can plot simple word clouds which give great insight, especially when they’re so easy to implement:
from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt text = " ".join(tweet_clean[tweet_clean['handle']=="Joe Biden"]['text'].tolist()) wordcloud = WordCloud().generate(text) # Display the generated image: plt.figure(figsize=(10,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
and same thing for Trump gives:
Sentiment Analysis
Next, we can do a rapid sentiment analysis with a little transformer magic. If you want a tutorial on transformers, I will write a series later. For now we can simply install and import:
!pip install transformers
from transformers import pipeline
Let’s run through all the tweets, and perform the analysis on each text. I have also created a nice progress counter:
sentiment_list = [] c = 0 counter_list = [int(tweet_clean.shape[0] * x/100) for x in range(100)] for text in tweet_data['text']: if c in counter_list: print("progress = {} %".format(counter_list.index(c))) sentiment_list.append(sentiment(text)[0]['label']) c+=1
And that is it ! We can group each candidate by the positive and negative sentiments, and perhaps ascertain who is more preferred.
tweet_clean.groupby(['handle','sentiment']).count()
We can get percentages too:
tweet_clean.groupby(['handle','sentiment']).count().apply(lambda g: g / g.sum())
Conclusion
Hopefully, this was a learning experience for everyone. Apparently, the candidates have more negative sentiment than positive. Try more fun things with this… until the results are out.