CNN vs FoxNews vs AP: News Space¶

Home

Summary¶

The following script revolves around embeddings that correspond to news article titles, corresponding to what I'm going to call news space. We want to know what regions of news space are biased in the direction of left or right. To this end, we look at the news space of CNN (left), Fox News (right), and AP (center).

While I hypothesized that there would be pockets of CNN and Fox heavy regions surrounding every news topic, denoting a left and right spin to everything, what I found was that Fox doubles down on politics, not talking as much about other categories in comparison to CNN and AP.

Embeddings¶

To understand what I did here, you have to understand what embeddings are. These are 768 dimensional vectors (of numbers) for each news article, make in such a way that articles that are similar to each other in context will be near each other in this embedding space. To read more about this, please go to my article here and go to "Solution 3: Maps."

If you do not know how to code¶

The following document is very code heavy, you're going to want to scroll all the way to the bottom. There, you will see interactive maps. Just know that points are news article titles. Articles that are similar to each other in context will be physically near to each other on the map. Everything should be self explanatory, save KNN entropy. If this score is low, then that region is heavy in AP or CNN or FoxNews. If it is high, then that region is relatively even between the three news sources.

In [ ]:
import pandas as pd

cnn = pd.read_feather('../data/CNN_sentence_embeddings.feather')
fox = pd.read_feather('../data/FoxNews_sentence_embeddings.feather')
ap = pd.read_feather('../data/AP_sentence_embeddings.feather')
In [ ]:
# For each df, only the first n sentences are used
num_sentences = 50000
cnn = cnn.iloc[:num_sentences]
fox = fox.iloc[:num_sentences]
ap = ap.iloc[:num_sentences]
In [ ]:
# Concatenate all three into a single data frame
df = pd.concat([cnn, fox, ap], ignore_index=True)
In [ ]:
from datetime import datetime

# Only take in articles that are some number of days old
df['Date'] = [i.replace('T', ' ') for i in df['Date']]
df['Date'] = [i.replace('Z', '') for i in df['Date']] 
df['Date'] = [datetime.strptime(i, '%Y-%m-%d %H:%M:%S') for i in df['Date']]
df['Time_delta'] = [df['Date'][0] - i for i in df['Date']]
df['Time_delta'] = [i.days for i in df['Time_delta']]
df['Date'] = [str(i) for i in df['Date']]
df = df[df['Time_delta'] < 60].reset_index(drop=True)
In [ ]:
# Calculate the max time deltas for each news User 
max_time_delta = df.groupby('User')['Time_delta'].max().reset_index()
display(max_time_delta)

# Tabulate each User
user_tab = df.groupby('User')['Time_delta'].count().reset_index()
display(user_tab)
User Time_delta
0 AP 59
1 CNN 59
2 FoxNews 59
User Time_delta
0 AP 4335
1 CNN 6003
2 FoxNews 12246
In [ ]:
np.random.seed(42)

# Randomly sample each User down to the same number of posts of the minum number of posts for a User
min_posts = min(df['User'].value_counts())
df = df.groupby('User').apply(lambda x: x.sample(min_posts)).reset_index(drop=True)

# Calculate the max time deltas for each news User 
max_time_delta = df.groupby('User')['Time_delta'].max().reset_index()
display(max_time_delta)

# Tabulate each User
user_tab = df.groupby('User')['Time_delta'].count().reset_index()
user_tab
User Time_delta
0 AP 59
1 CNN 59
2 FoxNews 59
Out[ ]:
User Time_delta
0 AP 4335
1 CNN 4335
2 FoxNews 4335
In [ ]:
# Return only the columns named 0 through 768 (the embedding columns)
em = df.iloc[:, 0:768]
display(em.shape)
(13005, 768)

What we're going to do is take the manifold of articles and take the K-nearest neighbors of each article in the orignal 768 dimensional space. We're then going to look at what the distribution of CNN, Fox, and AP is for each KNN. Are the pockets of only CNN? Only Fox? Only AP?

In [ ]:
import numpy as np
import pandas as pd
from annoy import AnnoyIndex

# Prepare the Annoy index
def build_annoy_index(em, num_trees=10):
    num_items, embedding_size = em.shape
    t = AnnoyIndex(embedding_size, 'angular')
    
    for i in range(num_items):
        t.add_item(i, em.iloc[i, :])
    
    t.build(num_trees)
    return t

# Get the KNNs for each item in the index
def get_knns_for_all(index, k, num_items):
    knns = [index.get_nns_by_item(i, k) for i in range(num_items)]
    return knns

# Build Annoy index
annoy_index = build_annoy_index(em)

# Get the KNNs for each item
k = 20
num_items = em.shape[0]
knn = get_knns_for_all(annoy_index, k, num_items)
In [ ]:
# Pull out the columns of the df that are NOT in em
df = df.iloc[:, 768:]
In [ ]:
# For the indices in question, get the corresponding column called User in df
def get_users_for_knns(knn, df):
    return df.iloc[knn, df.columns.get_loc('User')]

# Get the users for each KNN
users = [get_users_for_knns(knn[i], df) for i in range(num_items)]

# Visualize the first 3
users[0:3]
Out[ ]:
[0             AP
 2181          AP
 281           AP
 12833    FoxNews
 3722          AP
 6051         CNN
 8948     FoxNews
 694           AP
 2451          AP
 11763    FoxNews
 560           AP
 5065         CNN
 10942    FoxNews
 989           AP
 713           AP
 662           AP
 1247          AP
 2084          AP
 8966     FoxNews
 8976     FoxNews
 Name: User, dtype: object,
 1             AP
 4788         CNN
 1715          AP
 5001         CNN
 5825         CNN
 7594         CNN
 2939          AP
 1534          AP
 1783          AP
 2081          AP
 1635          AP
 3572          AP
 1115          AP
 10772    FoxNews
 7584         CNN
 3140          AP
 2442          AP
 1697          AP
 1671          AP
 5888         CNN
 Name: User, dtype: object,
 2        AP
 3240     AP
 6195    CNN
 235      AP
 8455    CNN
 1315     AP
 4145     AP
 3374     AP
 350      AP
 103      AP
 4138     AP
 1931     AP
 1242     AP
 1991     AP
 5546    CNN
 6366    CNN
 8221    CNN
 3482     AP
 7880    CNN
 6718    CNN
 Name: User, dtype: object]

What we're doing below is calculating the Shannon Entropy for each KNN. The idea here is that lower entropy corresponds to KNN that are more one-sided. Eg, for a KNN of 10, 8 CNNs, 1 Fox and 1 AP will have a lower entropy than 4 CNNs, 3 Fox, and 3 AP. We want to know what regions of news space are one-sided and what regions are balanced in terms of the news media outlets in question.

In [ ]:
# For each KNN, get the Shannon Entropy of the set of Users
def get_entropy_for_knn(users):
    return pd.Series(users).value_counts(normalize=True).apply(lambda p: -p*np.log(p)).sum()

# Get the entropy for each KNN
entropy = [get_entropy_for_knn(users[i]) for i in range(num_items)]

# Add the entropy to the df
df['knn_entropy'] = entropy

In the low-entropy regions, which are primarily CNN, primarily Fox, and primarily AP? To answer this, we create these as features we can color by later.

In [ ]:
# Get per-knn fraction CNN or any user
def get_fraction_user_for_knn(users, user='CNN'):
    value_counts = users.value_counts(normalize=True)
    return value_counts.get(user, 0) 

# Get the fraction CNN for each KNN
fraction_cnn = [get_fraction_user_for_knn(users[i], user = 'CNN') for i in range(num_items)]
fraction_fox = [get_fraction_user_for_knn(users[i], user='FoxNews') for i in range(num_items)]
fraction_ap = [get_fraction_user_for_knn(users[i], user='AP') for i in range(num_items)]


# Add each of these to df
df['knn_fraction_cnn'] = fraction_cnn
df['knn_fraction_fox'] = fraction_fox
df['knn_fraction_ap'] = fraction_ap
In [ ]:
# Sort dataframe by entropy
df = df.sort_values(by='knn_entropy', ascending=False)
df
Out[ ]:
Url Date ID ConversationID Language Source User Likes Retweets Replies ... TCooutlinks RetweetedTweet QuotedTweet MentionedUsers Unnamed: 0 Time_delta knn_entropy knn_fraction_cnn knn_fraction_fox knn_fraction_ap
5373 https://twitter.com/CNN/status/162196936511234... 2023-02-04 20:30:06 1.621969e+18 1.621969e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 181 47 41 ... ['https://t.co/UIRpxEe4pT'] NaN NaN None NaN 58 1.096067 0.35 0.30 0.35
3544 https://twitter.com/AP/status/1625369082898808833 2023-02-14 05:39:21 1.625369e+18 1.625325e+18 en <a href="http://www.socialflow.com" rel="nofol... AP 246 159 24 ... ['https://t.co/MgxFoigPt4'] NaN None None NaN 48 1.096067 0.35 0.30 0.35
4424 https://twitter.com/CNN/status/163705758350353... 2023-03-18 11:45:17 1.637058e+18 1.637055e+18 en <a href="https://about.twitter.com/products/tw... CNN 124 29 12 ... [] NaN NaN None NaN 16 1.096067 0.35 0.30 0.35
11677 https://twitter.com/FoxNews/status/16427236576... 2023-04-03 03:00:14 1.642724e+18 1.642724e+18 en <a href="http://www.socialflow.com" rel="nofol... FoxNews 46 5 19 ... ['https://t.co/0RzFhkYkwK'] NaN NaN None NaN 0 1.096067 0.35 0.35 0.30
3364 https://twitter.com/AP/status/1632874963605528576 2023-03-06 22:45:03 1.632875e+18 1.632875e+18 en <a href="https://trueanthem.com/" rel="nofollo... AP 250 62 23 ... ['https://t.co/SdHqIe9y2Q'] NaN None None NaN 28 1.096067 0.35 0.30 0.35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5960 https://twitter.com/CNN/status/164176480358366... 2023-03-31 11:30:06 1.641765e+18 1.641765e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 139 23 18 ... [] NaN NaN [User(username='EvaLongoria', id=110827653, di... NaN 3 0.000000 1.00 0.00 0.00
5961 https://twitter.com/CNN/status/162547622622264... 2023-02-14 12:45:06 1.625476e+18 1.625476e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 197 44 195 ... ['https://t.co/TOj1UROo9b'] NaN NaN None NaN 48 0.000000 1.00 0.00 0.00
7819 https://twitter.com/CNN/status/163783515809395... 2023-03-20 15:15:05 1.637835e+18 1.637835e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 245 49 40 ... [] NaN NaN [User(username='EvaLongoria', id=110827653, di... NaN 14 0.000000 1.00 0.00 0.00
6764 https://twitter.com/CNN/status/163894877786329... 2023-03-23 17:00:13 1.638949e+18 1.638949e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 298 51 35 ... [] NaN NaN [User(username='EvaLongoria', id=110827653, di... NaN 11 0.000000 1.00 0.00 0.00
7881 https://twitter.com/CNN/status/164270098714289... 2023-04-03 01:30:09 1.642701e+18 1.642701e+18 en <a href="http://www.socialflow.com" rel="nofol... CNN 207 39 24 ... [] NaN NaN None NaN 0 0.000000 1.00 0.00 0.00

13005 rows × 25 columns

We use UMAP as a visualization tool to see our entropy and fraction-news-media-outlet scores. Note that we don't do the KNN-based computations on the map. We use the original 768 dimensions. This is because UMAP (and dimensionality reduction in general) is lossy: you lose information when you compress your data from 768 dimensions down to 2.

In [ ]:
import umap
np.random.seed(42)

# Run umap on the embeddings
umap = umap.UMAP(densmap = True, n_components=2, random_state=42).fit_transform(em)

# Name the columns umap1 and umap2
umap = pd.DataFrame(umap, columns=['umap1', 'umap2'])

# Combine with the df
df = pd.concat([df, umap], axis=1)

Below we do a little bit of cleanup to make the visuals easier to navigate. We trim the data so we only look at the last two weeks, though we did the KNN computations on more than that.

In [ ]:
# Drop NaN values from Tweet
# df = df.dropna(subset=['Tweet']) It appears that the NAs come from the sort by time-delta
df.Tweet # Debug
Out[ ]:
5373     On Friday night, a new national record for low...
3544     BREAKING: Police say the suspect in the fatal ...
4424     CNN also identified more than 25 mariners who ...
11677    Melissa Rivers shares do's and don'ts of count...
3364     The war in Ukraine has created a surge in dema...
                               ...                        
5960     .@EvaLongoria’s culinary journey through Mexic...
5961     Here are 5️⃣ things you need to know today:\n\...
7819     .@EvaLongoria is coming to CNN in the new CNN ...
6764     Can you name a more iconic duo than chocolate ...
7881     Dutch cheese is the secret to this beloved Mex...
Name: Tweet, Length: 13005, dtype: object
In [ ]:
df['Tweet'] = [i.split('http')[0] for i in df['Tweet']]
df['Tweet'] = df['Tweet'].str.wrap(30)
df['Tweet'] = df['Tweet'].apply(lambda x: x.replace('\n', '<br>')) # bug

We visualize the output using plotly, and interactive graphics library. You can see the article title and other data when you hover over the points of interest.

In [ ]:
import plotly.express as px

# You need this to the plots will visualize in a rendered html page
import plotly.io as pio
pio.renderers.default = "notebook_connected"

# Do a plotly scatter and color by entropy
fig = px.scatter(df, x='umap1', y='umap2', color='knn_entropy', size_max = 5, hover_data=['User', 'Tweet', 'Likes', 'knn_entropy'])
fig.show()

fig = px.scatter(df, x='umap1', y='umap2', color='User', size_max = 5, hover_data=['User', 'Tweet', 'Likes', 'knn_entropy'])
fig.show()

# Color by fraction CNN, Fox, or AP
fig = px.scatter(df, x='umap1', y='umap2', color='knn_fraction_cnn', size_max = 5, hover_data=['User', 'Tweet', 'Likes', 'knn_entropy'])
fig.show()

fig = px.scatter(df, x='umap1', y='umap2', color='knn_fraction_fox', size_max = 5, hover_data=['User', 'Tweet', 'Likes', 'knn_entropy'])
fig.show()

fig = px.scatter(df, x='umap1', y='umap2', color='knn_fraction_ap', size_max = 5, hover_data=['User', 'Tweet', 'Likes', 'knn_entropy'])
fig.show()

Now we can look at the KNN entropy of a given news media outlet. If a data point corresponding to Fox has a low KNN entropy, then it is likely in a neighborhood that is almost entrely Fox. This means that lower entropy on average corresponds to globally more regions that are Fox-heavy.

In [ ]:
# Make box plots of the entropy for each user, include points
fig = px.box(df, x='User', y='knn_entropy')
fig.show()
In [ ]:
# Calculate the mean and standard deviation of the entropy for each user
df.groupby('User').agg({'knn_entropy': ['mean', 'std']})
Out[ ]:
knn_entropy
mean std
User
AP 0.845361 0.207332
CNN 0.850937 0.225392
FoxNews 0.805041 0.255270

In general, what I observe is that Fox has more Fox-heavy regions than CNN or AP, but not by a ton if you control for date and you randomly sample because Fox has more news stories per day in general. The Fox-heavy regions correspond to the topic of politics, and are often political reactions to things (eg. Republication person X blasts Democratic president Y for incompetence with respect to news event Z). CNN-heavy (or Fox-light) regions involve non-political things, like science, art, and culture. We note that there are AP-heavy pockets, which I did not expect, as I thought that AP would be evenly distributed.