Analyzing News on the Week of the 2020 US Election

Before we start: Dataset

We will be using an API made by Specrom Analytics- the Historical News API (see here)

import Algorithmiaimport pandas as pdclean_list=[]input={
'domains':'cnn.com',
'topic':'politics',
'q':'',
'qInTitle':'',
'content':'false',
'page':'1',
'author_only':'false'

}
client = Algorithmia.client('ENTER_YOUR_ALGO_KEY')
algo = client.algo('specrom/Historical_News_API/0.2.2')
raw_dict = algo.pipe(input).result

clean_list = clean_list + raw_dict["Article"]
news_df=pd.DataFrame(clean_list)
news_df.head()

NLP

“Natural language processing (NLP) is the field of understanding human language using computers.”

Excerpt from: Arumugam, Rajesh; Shanmugamani, Rajalingappaa;. “Hands-On Natural Language Processing with Python

  • Analyzing sentiment
  • Recognizing named entities
  • Translating text
  • Detecting spam

Explanatory Data Analysis

Tokenization

First download and import all the required libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
nltk.download()
title_list=news_df['title'].str.cat(sep='')from nltk.tokenize import word_tokenize
tok_title= word_tokenize(title_list)

Stopword and Punctuation Removal

Its time for an essential step in any NLP EDA and that is to remove stopwords and punctuations.

corpus=[]
corpus=[word for word in tok_title]
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
new_stop_words=["'s","n't",'opinion','Video']for i in corpus:
if "CNN" in i:
new_stop_words.append(i)
new_stop_words=list(set(new_stop_words))# adds the new stopwords to existing list of stop wordsstop=stop.union(new_stop_words)
from string import punctuationcorpus=[char for char in corpus if (char not in punctuation)]
corpus=[char for char in corpus if (char not in stop)]
from collections import Counter
counter=Counter(corpus)
most=counter.most_common()
x,y=[],[]for word, count in most[:50]:
if word not in stop:
x.append(word)
y.append(count)

fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_d"))

NGram Exploration

Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc.
If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on.

from sklearn.feature_extraction.text import CountVectorizer
def get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams=get_top_ngram(news_df['title'],2)[:10]
x,y=map(list,zip(*top_n_bigrams))
fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))
top_n_trigrams=get_top_ngram(news_df['title'],3)[:10]
x,y=map(list,zip(*top_n_trigrams))
fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))
top_n_pentagrams=get_top_ngram(news_df['title'],5)[:10]
x,y=map(list,zip(*top_n_pentagrams))
fig_dims=10,10
fig,ax=plt.subplots(figsize=fig_dims)
sns.barplot(x=y,y=x, ax=ax,palette=("Blues_r"))

Wordcloud

Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance.

from wordcloud import WordClouddef show_wordcloud(data):
wordcloud=WordCloud(background_color='white', stopwords=stop, max_words=100,
max_font_size=30,scale=3,random_state=1)
wordcloud=wordcloud.generate(str(data))
fig = plt.figure(1, figsize=(15,15))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()
show_wordcloud(corpus)
  • max_words: Indicates the maximum number of words to be displayed.
  • max_font_size: maximum font size.

Sentiment Analysis

Sentiment analysis is a very common natural language processing task in which we determine if the text is positive, negative or neutral. This is very useful for finding the sentiment associated with reviews, comments which can get us some valuable insights out of text data

  • subjectivity: refers to how someone’s judgment is shaped by personal opinions and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1].
#Sentiment Analysis
from textblob import TextBlob
def polarity(text):
return TextBlob(text).sentiment.polarity
news_df['polarity']=news_df['description'].apply(lambda x: polarity(x))news_df['polarity'].hist(figsize=(10,10))
['Early voting: Supreme Court moves in Pennsylvania and North Carolina set up potential post-election court fight over mail-in ballots - CNNPolitics',

'US election 2020: What India thinks of the US election (opinion) - CNN',
"Where Trump and Biden stand in CNN's latest poll of polls","Biden says 'no excuse for looting' in wake of Wallace shooting - CNN Video",'Stay inside this Halloween with your household, doctors say - CNN']

Conclusion

In this article we saw how to get news data using the News API by Spercom Analytics. We also saw methods in data cleaning such as tokenization and lemmatization. We also did a extensive EDA on the common words used in the news articles and also did a sentiment analysis on the news pieces itself.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store