Georgetown Data Science - Team BuzzFeed

BUZZ FEEDERFINDING OUT THE TRENDS BEHIND WHAT’S TRENDING

TEAM

➔ Anurag Khaitan➔ Josh Erb➔ Walter Tyrna

CONTEXT

WHAT IS BUZZFEED?“BuzzFeed is a cross-platform, global network for news and entertainment that generates seven billion views each month. BuzzFeed creates and distributes content for a global audience and utilizes proprietary technology to continuously test, learn and optimize.”

(buzzfeed website)

More than 7 billion monthly global content views

More than 200M monthly unique visitors to BuzzFeed.com

11 international editions including US, UK, Germany, Espanol, France, Spain, India, Canada, Mexico, Brazil, Australia and Japan

(buzzfeed website)

PROBLEM• There is good money to be made from

consistently generating popular content on the internet.

• A significant portion (20%-30%) of Buzzfeed’s articles generate very little traffic.

HypothesisWe believe there may be a correlation between the content of the language associated with an article (title, description, tags, etc.) and how likely it is to go viral.We also believe that this likelihood is tied to the country in which an article goes viral

WHY DOES IT MATTER?• 20%-30% of the articles we

pulled, gained little traction• BuzzFeed could hypothetically

save money and improve user experience by informing content by what topics consistently draw readership

SOLUTION APPROACH• Visualization to help identify underlying

themes in a given dataset through three lenses-the title, the content of the article itself, or the tags ascribed to it by the author.

• Title Generator to suggest topics and themes based upon recent trends in the Social Media to guide the editing staff in writing content that is likely to generate significant online traffic.

• Given sufficient number of articles in our data and trending topics, we believe that the product of reasonable title generator can be fed into a predictor to help assess its potential virality.

OUR EXPERIENCE

INGESTION“You need to start pulling data, like, now.”- Ben Bengfort, 1st Day of Class

➔ Project required us to gather data from 5 separate public APIs

➔ Before anything else, it was necessary to automate the process of querying the APIs

➔ Set up an ubuntu instance on Amazon Web Services’ Elastic Compute Cloud (EC2)

➔ Run Python Script hourly (crontab) to capture .json files on a server-side WORM -- 5 calls/hour, each for Australia, Canada, India, UK and US

Data Collection began: May 18, 2016.

Data Collection ended: Aug 31, 2016

Total raw data size in WORM: 1.16GB.

Number of records pulled: 330,000 (25 articles/hr each for 5 countries for 100 days)

ARCHITECTURE

WRANGLING➔ Clean Raw Data

◆ Remove tags, images and other content outside the scope of our analysis

◆ Used insight from this to drop irrelevant variables and identify gaps that could be accounted for

➔ Understand Target Variable (Measure of Virality)◆ A frequency column to understand how each article was

“persisting”, as a measure of virality◆ Understand the accuracy and applicability of Number of

Impressions provided in the data➔ Capture all Instances, Features and Target Variables in Postgres

Table to use downstream in the pipeline

WHAT DOES THE DATA LOOK LIKE? Australia Canada India

UK US

9%5%7%17%62%

ANALYSIS➔ Word Clouds

◆ What terms “jump out”?➔ Natural Language Toolkit

◆ What sorts of analysis can we run on our textual data?

➔ Sci-Kit Learn◆ What can Machine Learning

models can help us predict?

TOP TERMSTags: Australia

1. game2. thrones3. australi

a4. season5. 66. fan7. twitter8. quiz9. stark10.hot

Canada1. canada2. canadia

n3. news4. social5. quiz6. animals7. twitter8. funny9. lol10.food

India1. social2. news3. india4. bollywoo

d5. indian6. twitter7. desi8. khan9. stories10.women

UK1. quiz2. british3. uk4. food5. trivia6. twitter7. you8. funny9. celebrit

y10.00s

US1. test2. quiz3. food4. recipes5. you6. funny7. news8. social9. summe

r10.music

● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles) while Australia and India have more distinct preferences.

● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia

● “Women/woman” only appears on the top list for India, perhaps reflective of readership

● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy Kaling Was Just Too Relatable On Twitter”)

Josh Erb

[email protected] is this something your word cloud code can tell you?

WORDCLOUDSTags

WORDCLOUDSTitles

TITLE GENERATOR• Generated a corpus of all the

unique titles from API pulls• Natural Language Toolkit: Trigram

Collocation Finder & Trigram Assoc Metrics

• Grabbing most likely subsequent words using Likelihood Ratios

• Introduced minor stochasticity to prevent it always providing the same titles

• Notable Examples:– “Canada Goose Is Most Calories”– “You More Hilary Duff or Lohan?”– “What Game of Thrones Fan if

You Guess We Thrones”

FEATURE SELECTION WHAT FEATURES ARE THE MOST TELLING - HYPOTHESIS

CATEGORY: SOME SIGNAL

There are 140+ categories on Buzzfeed? Is there a relationship between the categories and virality?

METAVALUE: TOO BROAD - NO SIGNAL

How many keywords are there? What is the relationship between virality and certain keywords?

➔ Each “Buzz” had 36 data points◆ Some of these data points were

standardized

◆ Some of them were not

➔ A significant amount of these data points did not contain any signal

➔ Other than category, only fields that contained signals had text/words that are contained in the article:◆ Decription, Title, Primary

Keywords◆ Tags, containing phrases and

words

TARGETMEASURE OF VIRALITY

IMPRESSIONS

Number of times an article is views

FREQUENCY

Number of hours an article stays on a country’s BuzzFeed page.

➔ Impressions: Inaccurate and aggregated measure in the snapshot

➔ Frequency: Another measure but not always aligned with the corresponding impression provided in the instance

➔ Some f(Impressions, Frequency) worked

➔ Needed to use the function to identify classes

➔ Log Transformation to account for wide variability and skewed distribution as follows:

Virality = Log (Impressions * Frequency)Non-Viral: Virality < mean- standard devitation

Viral: Virality >= mean - standard deviation

FEATURE ENGINEERINGFEATURE ENGINEERING

ATTEMPTED OBVIOUS ONES

STOP WORDS OR COMMON WORDS COULD HAVE HELPED

➔ Title Length: Fairly constant and not a good indicator.

➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such correlation in the data.

➔ Words in tags: To retain the context in the tags, we used individual phrases, as provided (simulated n-grams) and individual words (1-gram).

➔ Low Document Frequency: No positive impact on the predictability.

➔ High Document Frequency: Negative impact on the predictability on the model.

➔ Stop Words OR Common Words: Did not attempt it due to time constraints.

MODELING WITH SCI-KIT LEARNMultinomial Naive Bayes and Logistic Regressionas follows:

Feature Selection: For each instance, we used all the text contained in Title, Description, Category, Primary Keywords and Phrases in Tags.

Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact

vect = CountVectorizer()Output Number of Features in vect: 70,000 more more features

Model Selection: For both models, we did 12-fold cross-validation as follows:skf = StratifiedKFold(y, n_folds=12, shuffle=True) for train, test in skf: …

Another cross-validation for both Multinomial NB and Logistic Regression as follows:

cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()

MODEL RESULTSMultinomial NB

Logistic Regression

Accuracy 0.839620 0.865165

AUC 0.699976 0.677515

F1 0.904905 0.922518

Precision 0.908419 0.898182

Recall 0.901438 0.948231

CROSS VALIDATION ACCURACY SCORES

Multinomial Naive Bayes:0.840168

Logistic Regression:0.864645

TOOLS

NLTK

Word Cloud

WHAT COULD BE DONE BETTER?

ROOM FOR IMPROVEMENT• BuzzFeed’s public API does not share the

whole story--Include data points from other sources

• Limit focus to English-speaking countries limited ability to see impact of cultural context outside of the US content-engine’s orbit.

• With more time, might apply a better methodology to the Title Generator

• With more time, might stand up the user-facing web application and capture user data to improve the model and generate better recommendations

QUESTIONS?

Data & Analytics

Georgetown Data Science - Team BuzzFeed