42
Headline Analysis John Qiu William Mckeehan Joshua Chavarria

Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Headline Analysis

John Qiu

William Mckeehan

Joshua Chavarria

Page 2: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Test Questions

1. What graph clustering?.

1. What is one of the graph clustering algorithms that was implemented in our

headline analysis?

1. What is the name of the API used to collect our data?

Page 3: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

John Qiu

• Born in China, came to America at age 2 - Grew up in Franklin, TN

• BBA in Economics, Minor in Math - May 2014

• MS in Business Analytics - Dec 2016

• Work at Oak Ridge National Lab - Health Data Sciences Institute

• Focus on Natural Language Processing

Page 4: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

William McKeehan

www.mckeehan.info

Page 5: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Joshua Chavarria• Computer Science Major

• Hometown: Los Angeles, CA

• Interests:

• Gaming

• Soccer

• Guitar

• Traveling

Page 6: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Introduction

● With headline analysis we are

clustering keywords in headlines

from a variety of sources in order

to compare them.

● Our hypothesis is that sources

with different perspectives are

going to have different

associations within their headlines

● (For example, CNN is more likely

to have Trump in a headline with

Russia, whereas Fox might have

Trump mentioned with Business.)

Page 7: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Motivation

• We believe that by looking at the associations within the

headlines of the sources, we can identify the different narratives

of each source.

• Goal: Compare a subset of news sources in order to show that

sources with differing perspectives would have different

associations within their headlines

Page 8: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Outline• Approach

• Overview

• Algorithms

• Applications

• Implementation

• Open Issues

Page 9: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Approach

1) Gather news source

2) Extract Entities

3) Note Relationships between co-

occurrences

4) Use clustering algorithms to aggregate

the relationships and compare sources

Page 10: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Overview of Cluster Analysis

• Cluster Analysis is not an algorithm, but rather a group of algorithms

• Any nonuniform data contains underlying structure due to the

heterogeneity of the data. The process of identifying this structure in

terms of grouping the data elements is called clustering

• Graph clustering is the process of finding sets of related vertices in a

graph and grouping them into “clusters”.

• This is a common technique amongst various fields, such as

statistical data analysis, data mining, and pattern recognition.

Page 11: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Overview of Cluster Analysis: Visual Example

Page 12: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Overview of Cluster Analysis• Given a data set, the goal of clustering is

to divide the data set into clusters such

that the elements assigned to a particular

cluster are similar or connected.

• Desirable Cluster Properties in Graphs:

• At least one path connecting each pair

of vertices within a cluster.

• If vertex u can’t reach vertex v, they

should not be in the same cluster.

• A subset of vertices forms a good

cluster if the induced subgraph is

dense.

Page 13: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Graph Clustering Algorithms: Intro

In a graph setting, clustering means partitioning the graph so that edges within a

group are large and edges across groups are small

Page 14: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Algorithms: Hierarchical Clustering

• A global clustering algorithm that creates a

hierarchical decomposition of sets of objects

using similarity matrix.

• Two Methods

• Agglomerative Approach (Bottom-Up)

• Divisive Approach (Top-Down)

• Advantages:

• Easy to implement and more robust to

noise.

• Disadvantages:

• Computationally demanding for large

data sets.

• Hard to identify clusters by dendogram

Page 15: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Agglomerative Hierarchical Clustering Pseudo Code:

Using Cosine Similarity as Similarity Measure:

• Initialize all vertices as individual clusters

• Using Adjacency Matrix, calculate pairwise similarity between all vertices

• Either:

• Merge the most similar vertices into same cluster (Single linkage

clustering) or

• Merge most different vertices into their most similar clusters (Complete-

linkage clustering)

• Update Adjacency Matrix

• Repeat for all vertices in a cluster Complexity:

Page 16: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Applications

• Clustering is often used to automatically generate feature representation for

data corresponding to a defined similarity measure.

• Specific uses include:

• Dimensionality reduction

• Multi-objective optimization

• Outlier/Anomaly detection

• Segmentation

• Applications:

• Recommendation systems - classifying users based on preferences

• Image Segmentation - classifying sections of images based on similar

pixels

Page 17: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Implementation: Data Collection

• Collect/compare headlines

• EventRegistery.org• Free

• Over 100,000 news publishers

• API

• Python Library

Page 18: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Bad Data Examples

• “DA seeks to revoke bond for accused drunk driver”

• “Levant Mediterranean dishes up small plates with big

flavor”

• “Manalapan (2) at Colts Neck (19) - Girls Lacrosse”

• “Checheche Catholic priest in sex scandal - Nehanda

Radio”

Page 19: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Native Media Source # Articles Political Affiliation

Agency Reuters 426 NA

Associated Press 688 NA

Cable Fox News 184 Trump

MSNBC 60 Clinton

Internet Breitbart 81 Trump

The Huffington

Post

254 Clinton

Network ABC News 134 Both

CBS News 78 Both

NBC News 96 Both

Newspaper New York Times 306 Clinton

Radio NPR.org 158 Clinton

Headline

Data

Summary

Page 20: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Descriptive Statistics

Page 21: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

How Can We use Clustering to Analyze our Headlines

And Compare Sources?

We will be working weighted undirected graphs to represent our data in two ways

Word Level Representation:

Clustering on a single source’s word-co-occurrence graph is an abstraction of

related content can be compared between sources.

Document Level Representation:

Use document representation similarity measures for all documents to

reveal similarities.

Page 22: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

How do Computers See/Read/Get Information

From Text?

1) Learn to Count Words

2) Learn which Words to count

3) Learn to produce representation words

Page 23: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

1) Term Document Vector/Matrix (Salton 1968)Definition: A document D from a corpus with n many unique

terms can be represented by a Term Document

Vector D = [d1,...,dn ] of length n

Pros:

• Quick to generate/normalize.

• Simple to interpret

• Introduced similarity measure to text data -

Euclidian Distance and Centroid clustering

(Salton 1975)

Cons:

• Huge Dimensionality but really sparce

• No language structure - word order

• Not how words work

Page 24: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Reutersnum articles: 426

orig vocab size 1587

mindf2 vocab size 607

vocab size 607

clust finished in 0.463397979736

words related to trump

right

rutte

fillon

Page 25: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Associated Press

---- Associated Press -------------------------------------

num articles: 688

orig vocab size 1687

mindf2 vocab size 860

vocab size 860

clust finished in 0.377697944641

words related to trump

conservative

Page 26: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

---- Fox News -------------------------------------

num articles: 184

orig vocab size 938

mindf2 vocab size 277

vocab size 277

clust finished in 0.170491933823

words related to trump

to

2016

struggle

starts

but

own

Page 27: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

---- MSNBC -------------------------------------

num articles: 60

orig vocab size 274

mindf2 vocab size 64

vocab size 64

clust finished in 0.00706195831299

words related to trump

up

Page 28: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

num articles: 81

orig vocab size 559

mindf2 vocab size 137

vocab size 137

clust finished in 0.0235621929169

words related to trump

gorsuch

for

or

clinton

Page 29: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

The Huffington Post

num articles: 254

orig vocab size 1239

mindf2 vocab size 362

vocab size 362

clust finished in 0.130997180939

words related to trump

election

didn

nomination

now

moonlight

Page 30: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

ABC News

num articles: 134

orig vocab size 633

mindf2 vocab size 177

vocab size 177

clust finished in 0.0359399318695

words related to trump

lawmakers

aca

bill

listening

her

blueprint

himself

prosecutor

Page 31: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

CBS News

num articles: 78

orig vocab size 457

mindf2 vocab size 87

vocab size 87

clust finished in 0.0120129585266

words related to trump

putin

health

russia

Page 32: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

NBC News

num articles: 96

orig vocab size 513

mindf2 vocab size 176

vocab size 176

clust finished in 0.0229661464691

words related to trump

flynn

Page 33: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

The New York Times

num articles: 306

orig vocab size 1262

mindf2 vocab size 345

vocab size 345

clust finished in 0.0602300167084

words related to trump

independence

let

pen

post

france

america

nears

ties

looks

foreign

pennsylvania

being

stories

at

Page 34: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Resultsotal num articles: 2465

orig vocab size 4636

mindf2 vocab size 2332

vocab size 2332

clust finished in 11.3553888798

words related to trump

governing

negotiate

feeling

that

camp

bad

citizens

gay

backing

demands

beijing

sparks

homes

partner

hike

Page 35: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

2) Better Representations from Labeled Datasets

Part of Speech Tagging:

Brown Corpus 1960 1,000,000 words tagged with part of speech

Lemmatization - mapping words to a root form:

E.g. [Franch, French] -> French

Page 36: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Open Issues

• Parameter selection

• Scalability

• Evaluation

• Fake News

Page 37: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Issue - Parameter selection

• How do you

determine the

parameter values to

give as input to the

clustering algorithm?

Page 38: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Issue - Scalability

• How does the runtime and

memory consumption of the

algorithm behave for massive

input graphs?

Page 39: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Issue - Evaluation

• How to decide which clusterings is the best?

Page 40: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Issue - Fake News

Page 41: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

References

http://www.lsi.upc.edu/~bejar/amlt/articulos/Graph%20Clustering03.pdf

http://world.mathigon.org/Graph_Theory

http://micans.org/mcl/

http://searchengineland.com/google-news-ranking-stories-30424

http://cs-people.bu.edu/mp/images/pap101a.pdf

https://en.wikipedia.org/wiki/Named-entity_recognition

https://en.wikipedia.org/wiki/Parse_tree

Page 42: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Discussion