Analyzing News on the Week of the 2020 US Election
See what words were most used by CNN Politics News headlines on the week of the 2020 US Election
The United States, one of the most powerful and influential country, is undergoing one of the most controversial and divisive election the country has ever seen.
News media outlets like CNN, Fox News, CNBC and so on have been watching these events very closely. So, if that’s the case, then what are they mostly reporting about?
In this article, we will conduct an explanatory data analysis on the news articles published by CNN Politics on the week of the 2020 US Election. We will use NLP tools to analyze the words that are commonly used and also conduct a sentiment analysis
Before we start: Dataset
We will be using an API made by Specrom Analytics- the Historical News API (see here)
To download this you will need to signup with Algorithmia but it’s free (no credit card required) and you get 10,000 free credits which are more than enough for thousands of API calls a month.
import Algorithmiaimport pandas as pdclean_list=[]input={
'domains':'cnn.com',
'topic':'politics',
'q':'',
'qInTitle':'',
'content':'false',
'page':'1',
'author_only':'false'
}
client = Algorithmia.client('ENTER_YOUR_ALGO_KEY')
algo = client.algo('specrom/Historical_News_API/0.2.2')
raw_dict = algo.pipe(input).result
clean_list = clean_list + raw_dict["Article"]news_df=pd.DataFrame(clean_list)
news_df.head()
This gives an output of 1074 articles all from 30th October 2020 to 3rd November 2020. For this analysis we are keeping only columns ‘description’ and ‘title.’
The final Dataset looks like this:
NLP
“Natural language processing (NLP) is the field of understanding human language using computers.”
Excerpt from: Arumugam, Rajesh; Shanmugamani, Rajalingappaa;. “Hands-On Natural Language Processing with Python
Natural Language Processing is the study of language processing. Some applications where NLP is used:
- Searching
- Analyzing sentiment
- Recognizing named entities
- Translating text
- Detecting spam
Most of the EDA done on this dataset would be using the NLP friendly python package- NLTK
Explanatory Data Analysis
Tokenization
First download and import all the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltknltk.download()
Now lets do some headline analysis. To do this we must first add all the words from the headlines to a single list. This process is called Tokenization. This is a popular preprocessing method used in NLP to process all the words from sentences such that each word is an element of a list.
title_list=news_df['title'].str.cat(sep='')from nltk.tokenize import word_tokenize
tok_title= word_tokenize(title_list)
Stopword and Punctuation Removal
Its time for an essential step in any NLP EDA and that is to remove stopwords and punctuations.
Punctuations that would only be noise for tokenization have been removed. Then the ones that were removed from the headlines:
Stop word removal is a common preprocessing step for an NLP application to remove commonly used words in English such as it, is, he, etc.
corpus=[]
corpus=[word for word in tok_title]from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
In our case, there some some words such as CNN, CNNPolitics etc which are included in the title and that must be removed as well. This is done as follows:
new_stop_words=["'s","n't",'opinion','Video']for i in corpus:
if "CNN" in i:
new_stop_words.append(i)new_stop_words=list(set(new_stop_words))# adds the new stopwords to existing list of stop wordsstop=stop.union(new_stop_words)
Removing Punctuations and Stopwords:
from string import punctuationcorpus=[char for char in corpus if (char not in punctuation)]
corpus=[char for char in corpus if (char not in stop)]
Seeing the most common words used:
from collections import Counter
counter=Counter(corpus)
most=counter.most_common()x,y=[],[]for word, count in most[:50]:
if word not in stop:
x.append(word)
y.append(count)
fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_d"))
We can see that the words “Election” and “Trump” has been used more 200 times.
NGram Exploration
Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc.
If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.
Looking at most frequent n-grams can give you a better understanding of the context in which the word was used.
To build a representation of our vocabulary we will use Countvectorizer. Countvectorizer is a simple method used to tokenize, vectorize and represent the corpus in an appropriate form. It is available in sklearn.feature_engineering.text
Bigrams:
from sklearn.feature_extraction.text import CountVectorizer
def get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams=get_top_ngram(news_df['title'],2)[:10]
x,y=map(list,zip(*top_n_bigrams))fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))
Trigrams:
top_n_trigrams=get_top_ngram(news_df['title'],3)[:10]
x,y=map(list,zip(*top_n_trigrams))fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))
Pentagrams:
top_n_pentagrams=get_top_ngram(news_df['title'],5)[:10]
x,y=map(list,zip(*top_n_pentagrams))fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))
As you can see, when we increase the ngrams, the more redundant the phrases become.
Wordcloud
Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance.
Creating wordcloud with python is easy but we need the data in a form of a corpus. Luckily, we created that when we wanted to visualize the common words earlier
from wordcloud import WordClouddef show_wordcloud(data):
wordcloud=WordCloud(background_color='white', stopwords=stop, max_words=100,
max_font_size=30,scale=3,random_state=1)
wordcloud=wordcloud.generate(str(data))
fig = plt.figure(1, figsize=(15,15))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()show_wordcloud(corpus)
As we can see, words and phrases such as “Election Day”, “Trump” are highly repeated.
There are many parameters that can be adjusted. Some of the most prominent ones are:
- stopwords: The set of words that are blocked from appearing in the image.
- max_words: Indicates the maximum number of words to be displayed.
- max_font_size: maximum font size.
There are many more options to create beautiful word clouds.
Sentiment Analysis
Sentiment analysis is a very common natural language processing task in which we determine if the text is positive, negative or neutral. This is very useful for finding the sentiment associated with reviews, comments which can get us some valuable insights out of text data
Here we are using a NLTK friendly sentiment analysis package- Textblob.
Textblob is a python library built on top of nltk. It has been around for some time and is very easy and convenient to use.
The sentiment function of TextBlob returns two properties:
- polarity: is a floating-point number that lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
- subjectivity: refers to how someone’s judgment is shaped by personal opinions and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].
We are applying textblob on our news descriptions to check for the polarity.
#Sentiment Analysis
from textblob import TextBlobdef polarity(text):
return TextBlob(text).sentiment.polaritynews_df['polarity']=news_df['description'].apply(lambda x: polarity(x))news_df['polarity'].hist(figsize=(10,10))
Output:
['Early voting: Supreme Court moves in Pennsylvania and North Carolina set up potential post-election court fight over mail-in ballots - CNNPolitics',
'US election 2020: What India thinks of the US election (opinion) - CNN',"Where Trump and Biden stand in CNN's latest poll of polls","Biden says 'no excuse for looting' in wake of Wallace shooting - CNN Video",'Stay inside this Halloween with your household, doctors say - CNN']
Conclusion
In this article we saw how to get news data using the News API by Spercom Analytics. We also saw methods in data cleaning such as tokenization and lemmatization. We also did a extensive EDA on the common words used in the news articles and also did a sentiment analysis on the news pieces itself.
Hopefully, the tools and techniques used here have useful to you.
Happy Exploring!