60
social data science Text as Data Sebastian Barfort August 18, 2016 University of Copenhagen Department of Economics 1/60

Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

social data scienceText as Data

Sebastian BarfortAugust 18, 2016

University of CopenhagenDepartment of Economics

1/60

Page 2: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

text all over the place

The widespread use of the Internet has led to anastronomical amount of digitized textual dataaccumulating every second through email, websites, andsocial media. The analysis of blog sites and social mediaposts can give new insights into human behaviors andopinions. At the same time, large-scale efforts to digitizepreviously published articles, books, and governmentdocuments have been underway, providing excitingopportunities for social scientists.

Imai (2016).

We need to learn how to think about and work with these kinds ofnew data

2/60

Page 3: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

resources

Names (selective): Will Lowe, Justin Grimmer, Kenneth Benoit,Margaret E. Roberts, Sven-Oliver Proksch, Suresh Naidy

R packages: tm, quanteda, stm, stringr, tidytext

3/60

Page 4: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

is your favorite professor biased?

Jelveh, Zubin, Bruce Kogut, and Suresh Naidu. “Detecting LatentIdeology in Expert Text: Evidence From Academic Papers inEconomics”. EMNLP. 2014.

Previous work on extracting ideology from text has focusedon domains where expression of political views is expected,but it’s unclear if current technology can work in domainswhere displays of ideology are considered inappropriate.We present a supervised ensemble n-gram model forideology extraction with topic adjustments and apply it toone such domain: research papers written by academiceconomists. We show economists’ political leanings can becorrectly predicted, that our predictions generalize to newdomains, and that they correlate with public policy-relevantresearch findings. We also present evidence thatunsupervised models can underperform in domains whereideological expression is discouraged. 4/60

Page 5: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

validation

We emphasize that the complexity of language implies thatautomated content analysis methods will never replacecareful and close reading of texts. Rather, the methods thatwe profile here are best thought of as amplifying andaugmenting careful reading and thoughtful analysis.Further, automated content methods are incorrect modelsof language. This means that the performance of any onemethod on a new data set cannot be guaranteed, andtherefore validation is essential when applying automatedcontent methods.

Grimmer and Stewart (2013).

5/60

Page 6: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

6/60

Page 7: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

bag of words

Bag of Words: The ordering and grammar of words does not informthe analysis.

Easy to construct sample sentences where word orderfundamentally changes the nature of the sentence, but for mostcommon tasks like measuring sentiment, topic modeling, etc. theydo not seem to matter (Grimmer and Stewart 2013)

7/60

Page 8: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

pre-processing

Stemming: Dimensionality reduction. Removes the ends of words toreduce the total number of unique words in the data.

Ex: family, families, families’, etc. all become famili.Stop words: Words that do not convey meaning but primarily servegrammatical purposes.

Uncommon Words: Typically, words that appear very often or veryrarely are excluded.

Also typically discard punctuation (although not always!),capitalization, etc.

8/60

Page 9: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

Classifying Documents into KnownCategories

9/60

Page 10: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

introduction

Inferring and assigning text to categories is perhaps most commonuse of contant analysis in the social sciences

Ex: Classifying ads as positive/negative, is legislation aboutenviorenment, etc.

Two broad approaches:

Dictionary Methods: Use relative frequency of key words to measurepresence of category in a given text

Supervised Learning: Build on and extend familiar manual codingtasks using algorithms

10/60

Page 11: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

dictionary methods

Perhaps the most simple and intuitive automated text classificationmethod

Use the rate at which key words appear in a text to classifydocuments into categories or to measure extent to which documentsbelong to particular category

Dictionary: a list of words that classify a particular collection ofwords

Note: For dictionary methods to work well, the scores attached toeach words must closely align with how the words are used in aparticular context

Dictionaries are rarely validated

11/60

Page 12: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

supervised learning methods

Dictionary methods require that we are able to apriori identify wordsthat separate classes

This can be wrong and/or inefficient

Supervised learning models are designed to automate the handcoding of documents

Supervised learning models: Human coders categorize a set ofdocuments by hand. The algorithm then “learns” how to sort thedocuments into categories using these training data and apply itspredictions to new unlabeled texts

12/60

Page 13: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

approach

1. Construct a training set2. Apply the supervised learning method using cross-validation3. Decide on “best” model and classify the remaining documents

13/60

Page 14: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

Classification with Unknown Categories

14/60

Page 15: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

introduction

Supervised and dictionary methods assume a well-defined set ofcategories

Often, this set of categories is difficult to derive beforehand

Is must be discovered from the text itself

Unsupervised Learning: Try to learn underlying features of textwithout explicitly imposing categories of interest

1. Estimate set of categories2. Assign documents (or part of documents) to those categories

Often: topic-models

15/60

Page 16: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

Measuring Latent Features in Texts

16/60

Page 17: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

introduction

Can we locate actors (politicians, newspapers, researchers) in anideological space using text data?

Assumption: Ideological dominance. Actors’ ideological preferencesdetermine what they discuss in texts.

Wordscores: Supervised learning approach. Special case ofdictionary method.

Wordfish: Unsupervised learning approach. Discover words thatdistinguish locations on a policy scale.

17/60

Page 18: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

wordscore

1. Select reference texts that define the position in the policyspace (e.g. a conservative and liberal politician)

2. Use training data to determine relative frequency of words.Creates a measure of how well various words separate thecategories

3. Use these word scores to scale remaining texts.

Disadvantage: Conflates policy dominance with stylistic differences

18/60

Page 19: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

Predicting Yelp Reviews

19/60

Page 20: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

introduction

David Robinson: Does sentiment analysis work? A tidy analysis ofYelp reviews

Sentiment analysis is often used by companies to quantify generalsocial media opinion (for example, using tweets about severalbrands to compare customer satisfaction).

One of the simplest and most common sentiment analysis methodsis to classify words as “positive” or “negative”, then to average thevalues of each word to categorize the entire document.

Can we use this approach to predict Yelp reviews?

20/60

Page 21: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

data

Can be downloaded from here

library(”readr”)library(”dplyr”)

infile = ”../nopub/yelp_academic_dataset_review.json”review_lines = read_lines(infile,

n_max = 50000,progress = FALSE)

21/60

Page 22: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

from json to data frame

library(”stringr”)library(”jsonlite”)

reviews_combined = str_c(”[”,str_c(review_lines,

collapse = ”, ”),”]”)

reviews = fromJSON(reviews_combined) %>%flatten() %>%tbl_df()

22/60

Page 23: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

tidy text data

Right now, there is one row for each review.

Remember bag of words assumption: predictors are at the word, notsentence level

-> We need to tidy the data

23/60

Page 24: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

library(”tidytext”)review_words = reviews %>%select(review_id, business_id, stars, text) %>%unnest_tokens(word, text)

review_words %>% dim

## [1] 5930037 4

24/60

Page 25: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

remove stop words

review_words = review_words %>%filter(!word %in% stop_words$word) %>%filter(str_detect(word, ”^[a-z’]+$”))

25/60

Page 26: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

review_id business_id stars word

Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 hoagieYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 institutionYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 walkingYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 throwbackYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 ago

26/60

Page 27: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

afinn

AFINN = sentiments %>%filter(lexicon == ”AFINN”) %>%select(word, afinn_score = score)

27/60

Page 28: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word afinn_score

abandon -2abandoned -2abandons -2abducted -2abduction -2

28/60

Page 29: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

reviews_sentiment = review_words %>%inner_join(AFINN, by = ”word”) %>%group_by(review_id, stars) %>%summarize(sentiment = mean(afinn_score))

29/60

Page 30: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

review_id stars sentiment

__-r0eC3hZlaejvuliC8zQ 5 4.0000000__77nP3Nf1wsGz5HPs2hdw 5 1.6000000__DK9Vsmyoo0zJQhIl5cbg 1 -2.1000000__ELCJ0wzDM2QNRfVUq26Q 5 3.5000000__esH_kgJZeS8k3i6HaG7Q 5 0.2142857

30/60

Page 31: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

df.review = reviews_sentiment %>%group_by(stars) %>%summarise(m.sentiment = mean(sentiment))

31/60

Page 32: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

1

2

3

4

5

1

2

3

4

5

0 1Sentiment Score (mean)

Num

ber

of S

tars

on

Yelp

32/60

Page 33: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word frequency

review_words_counted = review_words %>%count(review_id, business_id, stars, word) %>%ungroup()

33/60

Page 34: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

review_id business_id stars word n

__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 amazing 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 attentive 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 breakfast 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 cheese 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 chili 1

34/60

Page 35: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word_summaries = review_words_counted %>%group_by(word) %>%summarize(businesses = n_distinct(business_id),

reviews = n(),uses = sum(n),average_stars = mean(stars)) %>%

ungroup() %>%arrange(reviews)

35/60

Page 36: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

0

5000

10000

15000

20000

10 1000reviews

coun

t

36/60

Page 37: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word_summaries_filtered = word_summaries %>%filter(reviews >= 200, businesses >= 10)

37/60

Page 38: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word businesses reviews uses average_stars

lives 162 200 205 3.785000regret 162 200 206 3.915000bloody 80 201 246 3.621891courses 84 201 242 3.800995crowds 130 201 208 3.766169

38/60

Page 39: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

0

50

100

150

200

1000 10000reviews

coun

t

39/60

Page 40: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

positive words

df.0 = word_summaries_filtered %>%arrange(-average_stars)

word businesses reviews uses average_stars

gem 272 504 509 4.482143superb 171 250 253 4.460000incredible 268 519 554 4.458574amazing 927 3696 4240 4.391775highly 736 1660 1729 4.388554

40/60

Page 41: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

negative words

df.1 = word_summaries_filtered %>%arrange(average_stars)

word businesses reviews uses average_stars

refused 178 205 226 1.604878worst 667 1215 1321 1.650206disgusting 252 321 347 1.735202rude 576 1005 1162 1.833831horrible 582 938 1043 1.835821

41/60

Page 42: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

gemsuperb amazinghighlyknowledgeablegardens perfectfavoritespersonable deliciousbeautiful

refused worstdisgusting

rudeawful terribletasteless

poorly waste poorchargedstale dirty managerattitudeexcuse

foodservice

nice

friendly

staff

love

peoplerestaurant

menu

delicious

bad

fresh

amazing

2

3

4

1000 10000# of reviews

Ave

rage

Sta

rs

42/60

Page 43: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

statistical analysis

1

2

3

4

5

−2.5 0.0 2.5 5.0sentiment

star

s

43/60

Page 44: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

cross validation

library(”purrr”)library(”modelr”)gen_crossv = function(pol,

data = reviews_sentiment){data %>%

crossv_mc(200) %>%mutate(mod = map(train,

~ lm(stars ~ poly(sentiment, pol),

data = .)),rmse.test = map2_dbl(mod, test, rmse),rmse.train = map2_dbl(mod, train, rmse)

)}

44/60

Page 45: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

set.seed(3000)df.cv = 1:8 %>%map_df(gen_crossv, .id = ”degree”)

45/60

Page 46: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

46/60

Page 47: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

Text Analysis of Donald Trump’s Tweets

47/60

Page 48: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

file = paste0(”http://varianceexplained.org/”,”files/”,”trump_tweets_df.rda”)

load(url(file))

48/60

Page 49: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

question

David Robinson: Text analysis of Trump’s tweets confirms he writesonly the (angrier) Android half

Who writes Donald Trump’s tweets?

They are written from two different devices: an iPhone and anAndroid

Can we examine quantitatively whether a tweet is written by DonalTrump himself or from someone on his staff?

49/60

Page 50: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

library(”tidyr”)

tweets = trump_tweets_df %>%select(id, statusSource, text, created) %>%extract(statusSource,

”source”, ”Twitter for (.*?)<”) %>%filter(source %in% c(”iPhone”, ”Android”))

50/60

Page 51: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

id source text created

762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46

51/60

Page 52: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

library(”tidytext”)library(”stringr”)

reg = ”([^A-Za-z\\d#@’]|’(?![A-Za-z\\d#@]))”tweet_words = tweets %>%filter(!str_detect(text, ’^”’)) %>%mutate(text =

str_replace_all(text,”https://t.co/[A-Za-z\\d]+|&amp;”, ””)) %>%unnest_tokens(word, text, token = ”regex”,

pattern = reg) %>%filter(!word %in% stop_words$word,

str_detect(word, ”[a-z]”))

52/60

Page 53: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

id source created word

676494179216805888 iPhone 2015-12-14 20:09:15 record676494179216805888 iPhone 2015-12-14 20:09:15 health676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016676509769562251264 iPhone 2015-12-14 21:11:12 accolade

53/60

Page 54: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

trump

bad

cruz

america

people

#makeamericagreatagain

clinton

crooked

#trump2016

hillary

0 50 100 150n

54/60

Page 55: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

android_iphone_ratios = tweet_words %>%count(word, source) %>%filter(sum(n) >= 5) %>%spread(source, n, fill = 0) %>%ungroup() %>%mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%mutate(logratio = log2(Android / iPhone))

55/60

Page 56: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

#makeamericagreatagain#trump2016

join#americafirst

#votetrump#imwithyou

#crookedhillary#trumppence16

7pmtomorrow

agobrexit

jokemails

strongtalkingspentweakcrazybadly

−5.0 −2.5 0.0 2.5

56/60

Page 57: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

sentiment analysis

nrc = sentiments %>%filter(lexicon == ”nrc”) %>%select(word, sentiment)

57/60

Page 58: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

word sentiment

abacus trustabandon fearabandon negativeabandon sadnessabandoned anger

58/60

Page 59: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

sadness fear anger disgust

surprise anticipation trust joy

0

1

2

3

0

1

2

3

−1

0

1

2

3

0

1

2

0

1

2

−4

−2

0

2

−3−2−1

012

−3

−2

−1

0

1

badl

ycr

azy

lost

wor

sedi

sast

er lieba

dki

lling

unfa

irto

ugh

craz

yfig

htw

orse

disa

ster

bad

killi

ngpo

lice

cour

tch

ange

terr

oris

t

craz

yan

gry

fight hi

tdi

sast

er lieba

dki

lling

unfa

ircr

ime

angr

ym

ess

disa

ster lie

bad

john

unfa

irdi

rty

liar

win

ning

chan

cedi

sast

erde

alho

pevo

teex

citin

gsh

ottr

ump

won

derf

ulte

rror

ist

leav

ew

inni

ng

time

deal

calls

cour

tw

atch

hope

vote top

win

ning

tom

orro

w

sena

teec

onom

ypo

lice

deal

syst

emtr

ade

calls

expl

ain

hono

rre

port

ersa

fe

deal

spec

ial

hope

vote

won

derf

ullo

ve pay

true

win

ning

safe

And

roid

/ iP

hone

log

ratio

Android

iPhone

59/60

Page 60: Social Data Science - Text as Data · resources Names(selective):WillLowe,JustinGrimmer,KennethBenoit, MargaretE.Roberts,Sven-OliverProksch,SureshNaidy Rpackages:tm,quanteda,stm,stringr,tidytext

your turn

Continue working with these data in groups

Think of interesting patterns you can explore in the data

What about the timing of tweets? Could everything be included inone large model?

60/60