University of Münster (27–28 June 2019) Kenneth Benoit

QuantitativeTextAnalysisusingRandquantedaKennethBenoit

UniversityofMünster(27–28June2019)

Contents1.Introductions

2.Fundamentals

3.GettingStarted

4.Workingwithacorpus

5.Keywords-in-context

6.Importingtextfiles

7.Tokenization

8.Creatingadocument-featurematrix

9.Dictionary(sentiment)Analysis

10.TextualStatistics

11.Classification

12.Topicmodelling

13.Datawranglingandvisualisation

PurposeoftheworkshopThisworkshopisdesignedto

introducethetoolsforquantitativetextanalysisusingR

focusspecificallyonthequantedapackage

demonstrateseveralmajorcategoriesofanalysis

answeranyquestionsabouttextanalysisthatyoumighthave

3

WhomthisworkshopisforThosefamiliarwithtextanalysismethodsbutnotinR/quanteda

ThosewhohavetriedR/quantedabutwanttolearnmoreproficiency

Themerely"text-curious"

Norealpre-requisites:We'reallheretolearn

4

AboutmeKenBenoit(http://kenbenoit.net)

ProfessorofComputationalSocialScience,DepartmentofMethodology,LondonSchoolofEconomicsandPoliticalScience

CreatorandmaintainerofthequantedaRpackageandrelatedtools

FounderoftheQuantedaInitiative,anon-profitcompanyaimedatadvancingdevelopmentanddisseminationofopen-sourcetoolsfortextanalytics

Myresearch:Politicalscience(partycompetition),measurement,textanalysis

5

http://kenbenoit.net/

https://quanteda.io/

http://quanteda.org/

Examplefrommyresearch

6

Examplefrommyresearch(cont)

7

Yourturn!1.Name?

2.Department,degree?

3.Researchinterests?

4.Previousexperiencewithtextanalysis/R?

5.Whyareyouinterestedintheworkshop?

8

Assumptions,Concepts,andExamples

Work�ow,demysti�ed

10

BasicconceptsandterminologyTexts:Organizedintodocuments

Corpus:Collectionoftexts,oftenwithassociateddocument-levelmetadata,whatIwillcall"documentvariables"ordocvars

Stems:wordswithsuffixesremoved(usingasetofrules)

Lemmas:canonicalwordform

Word win winning wins won

Stem win win win won

Lemma win win win win

11

Basicconceptsandterminology(cont.)Stopwords:wordsthataredesignedforexclusionfromanyanalysisoftext

Partsofspeech:linguisticmarkersindicatingthegeneralcategoryofaword'singuisticproperty,e.g.noun,verb,adjective,etc.

Namedentities:areal-worldobject,suchaspersons,locations,organizations,products,etc.,thatcanbedenotedwithapropername,oftenaphrase,e.g."UniversityofBergen"or"UnitedKingdom"

Multi-wordexpressions:sequencesofwordsdenotingasingleconcept(andwouldbeinGerman),e.g.valueaddedtax(inGerman:Mehrwertsteuer)

12

Typesandtokenstokens:asequenceofcharactersthataregroupedtogetherasausefulsemanticunit

words

couldalsoincludepuncutationcharactersorsymbols

stemsorlemmas

multi-wordexpressions

namedentities

usually,butnotalways,delimitedbyspaces

type:auniquetoken

13

toks<-tokens(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus"),remove_punct=TRUE)toks

##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"documents"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"

tokens_wordstem(toks)

##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"document"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"

14

tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))

##tokensfrom2documents.##text1:##[1]"corpus""set""document"####text2:##[1]"second""document""corpus"

tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))%>%types()

##[1]"corpus""set""document""second"

15

Typicalwork�owsteps1.Lowercase

"acorpusisasetofdocuments."

"thisistheseconddocumentinthecorpus."

2.Removestopwordsandpunctuation

"acorpusisasetofdocuments."

"thisistheseconddocumentinthecorpus."

3.Stem

"corpussetdocuments"

"seconddocumentcorpus"

4.Tokenize

[corpus,set,document]

[second,document,corpus]

5.Createadocument-featurematrix

a"bag-of-words"conversionofdocumentsintoamatrixthatcountsthefeatures(types)bydocument

16

Document-featurematrixDocument1:"Acorpusisasetofdocuments."Document2:"Thisistheseconddocumentinthecorpus."

corpus set document second

Document1 1 1 1 0

Document2 1 0 1 1

...

Documentn 0 1 1 0

17

Gettingstarted

InstallingquantedaInstallthequantedapackagefromCRAN:

install.packages("quanteda")

Youshouldalsoinstallthereadtextpackage:

install.packages("readtext")

Afterwards,loadbothpackages:

library("quanteda")library("readtext")packageVersion("quanteda")

##[1]'1.4.5'

packageVersion("readtext")

##[1]'0.75'

19

https://cran.r-project.org/package=quanteda

ReproduceexampleinquantedaCreateatextcorpus

library(quanteda)#Createtextcorpuscorp<-corpus(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus."))

Exercise:Createthiscorpusandgetasummaryusingsummary(corp).

summary(corp)

##Corpusconsistingof2documents:####TextTypesTokensSentences##text1881##text2891####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:202019##Notes:

20

ReproduceexampleinquantedaCreateadocument-featurematrix

dfm(corp,remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)

##Document-featurematrixof:2documents,4features(25.0%sparse).##2x4sparseMatrixofclass"dfm"##features##docscorpussetdocumentsecond##text11110##text21011

Question:Howdoesthedfmchangewhenwechangethepreprocessingsteps?

21

ThequantedaInfrastructure

quanteda:QuantitativeAnalysisofTextualData6yearsofdevelopment,28releases

8,300commits,270,000downloads

23

Designofthepackageconsistentgrammar

flexibleforpowerusers,simpleforbeginners

analytictransparencyandreproducibility

compabilitywithotherpackages

pipelinedworkflowusingmagrittr's%>%

24

QuantedaInitiativeUKnon-profitorganizationdevotedtothepromotionofopen-sourcetextanalysissoftware

User,technicalanddevelopmentsupport

Teachingandworkshops

https://quanteda.org

25

https://quanteda.org/

AdditionalpackagesForsomeexercises,wewillneedquanteda.corpora:

install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")

ForPOStagging,entityrecognition,anddependencyparsing,youshouldinstallspacyr(notcoveredextensively).

install.packages("spacyr")

Installationinstructions:http://spacyr.quanteda.io

26

http://spacyr.quanteda.io/

CourseresourcesDocumentation:

https://quanteda.io

https://readtext.quanteda.io

https://spacyr.quanteda.org

Tutorials:https://tutorials.quanteda.io

Cheatsheet:https://www.rstudio.com/resources/cheatsheets/

KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.

27


https://readtext.quanteda.io/

https://spacyr.quanteda.org/

https://tutorials.quanteda.io/

https://www.rstudio.com/resources/cheatsheets/

https://doi.org/10.21105/joss.00774

Recap:Work�ow1.Corpus

Savescharacterstringsandvariablesinadataframe

Combinestextswithdocument-levelvariables

2.Tokens

Storestokensinalistofvectors

Positional(string-of-words)analysisisperformedusingandtextstat_collocations(),tokens_ngrams()andtokens_select()orfcm()withwindowoption

3.Document-featurematrix(DFM)

Representsfrequenciesoffeaturesindocumentsinamatrix

Themostefficientstructure,butitdoesnothaveinformationonpositionsofwords

Non-positional(bag-of-words)analysisareperformedusingmanyofthetextstat_*andtextmodel_*functions

28

MainfunctionclassesTextcorpus:corpus()

Tokenization:tokens()

Document-featurematrix:dfm()

Featureco-occurencematrix:fcm()

Textstatistics:textstat_()

Textmodels:textmodel_()

Plots:textplot_()

29

NamingconventionsandusefulshortcutsRecommendednamingconventionsforobjects

Inthequantedastyleguide,werecommendtonameobjectsconsistently,e.g.:

Corpus:corp_...

Tokensobject:toks_...

Dfm:dfmat_...

Textmodel:tmod_...

UsefulRStudiokeyboardshortcuts

Insertpipeoperator(%>%):Shift+Cmd/Cntrl+M

Insertassignmentoperator(<-):Alt+-

MoreshortcutsforRStudiocanbefoundhere

30

https://github.com/quanteda/quanteda/wiki/Style-guide

https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts

Clari�cationThefollowingexpressionsresultinthesameoutput

data_corpus_inaugural%>%tokens()

tokens(data_corpus_inaugural)

31

32

Workingwithcorpusobjects

Corpusfunctionsinquantedacorpus()

corpus_subset()

corpus_reshape()

corpus_segment()

corpus_sample()

Corporainquanteda

data_corpus_inaugural

data_corpus_irishbudget2010

Additionalcorporainthequanteda.corporapackage

34

Usingmagrittr'spipedata_corpus_inaugural%>%corpus_subset(President=="Obama")%>%ndoc()

##[1]2

data_corpus_inaugural%>%corpus_subset(President=="Obama")%>%corpus_reshape(to="sentences")%>%ndoc()

##[1]198

35

Accessnumberoftypesandtokensofcorpusntype(data_corpus_inaugural)%>%head()

##1789-Washington1793-Washington1797-Adams1801-Jefferson##62596826717##1805-Jefferson1809-Madison##804535

ntoken(data_corpus_inaugural)%>%head()

##1789-Washington1793-Washington1797-Adams1801-Jefferson##153814725781927##1805-Jefferson1809-Madison##23811263

36

Overviewofdocument-levelvariableshead(docvars(data_corpus_inaugural))

##YearPresidentFirstName##1789-Washington1789WashingtonGeorge##1793-Washington1793WashingtonGeorge##1797-Adams1797AdamsJohn##1801-Jefferson1801JeffersonThomas##1805-Jefferson1805JeffersonThomas##1809-Madison1809MadisonJames

37

Exercise1.Basedondata_corpus_inaugural,createanobjectdata_corpus_postwar(speechessince1945).

2.Whatspeechhasmosttokens?Whatspeechhasmosttypes?

Note:Youcanfindthedocumentationandexamplesusing?followedbythenameofthefunction.

38

Solutiondata_corpus_postwar<-data_corpus_inaugural%>%corpus_subset(Year>1945)

#numberoftokensperspeechdata_corpus_postwar%>%ntoken()%>%sort(decreasing=TRUE)%>%head()

##1985-Reagan1981-Reagan1953-Eisenhower2009-Obama##2921279027572711##1989-Bush1949-Truman##26812513

#numberoftypesperspeechdata_corpus_postwar%>%ntype()%>%sort(decreasing=TRUE)%>%head()

##2009-Obama1985-Reagan1981-Reagan1953-Eisenhower##938925902900##2013-Obama1989-Bush##814795

39

Addingdocument-levelvariables#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"Order")<-1:ndoc(data_corpus_inaugural)

head(docvars(data_corpus_inaugural,"Order"))

##[1]123456

#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"PresidentFull")<-paste(docvars(data_corpus_inaugural,"FirstName"),docvars(data_corpus_inaugural,"President"),sep="")

head(docvars(data_corpus_inaugural))

##YearPresidentFirstNameOrderPresidentFull##1789-Washington1789WashingtonGeorge1GeorgeWashington##1793-Washington1793WashingtonGeorge2GeorgeWashington##1797-Adams1797AdamsJohn3JohnAdams##1801-Jefferson1801JeffersonThomas4ThomasJefferson##1805-Jefferson1805JeffersonThomas5ThomasJefferson##1809-Madison1809MadisonJames6JamesMadison

40

Exercise1.Usedata_corpus_inauguralandreshapetheentirecorpustothelevelofsentencesandstorethenewcorpus.Whatisthenumberofdocumentsofthereshapedcorpus?

2.Addadocument-levelvariabletothereshapedcorpusthatcountsthetokenspersentence.

3.Keeponlysentencesthatarelongerthan10words.

4.Reshapethecorpusbacktothelevelofdocumentsandstorethecorpusasdata_corpus_inaugural_subset.

5.Optional:findamoreefficientsolution.

41

Solutioncorp_sentences<-corpus_reshape(data_corpus_inaugural,to="sentences")

docvars(corp_sentences,"number_tokens")<-ntoken(corp_sentences,remove_punct=TRUE)

ndoc(corp_sentences)

##[1]5016

corp_sentence_subset<-corp_sentences%>%corpus_subset(number_tokens>10)ndoc(corp_sentence_subset)

##[1]4267

42

Solution(cont.)data_corpus_inaugural_subset<-corp_sentence_subset%>%corpus_reshape(to="documents")

sum(ntoken(data_corpus_inaugural))

##[1]149145

sum(ntoken(data_corpus_inaugural_subset))

##[1]142617

43

Keywords-in-context

Keywords-in-context(KWIC)Keywords-in-context(kwic())returnsalistofoneormorekeywordsanditsimmediatecontext.

kw_security<-kwic(data_corpus_inaugural,pattern="security",window=3)

head(kw_security,3)

####[1789-Washington,1497]governmentforthe|security|##[1813-Madison,321]seasandthe|security|##[1817-Monroe,1610]mayformsome|security|####oftheirunion##ofanimportant##againstthesedangers

#Checknumberofoccurrencesnrow(kw_security)

##[1]65

45

KWICwithmultiwordexpressionsUsephrase()tolookupmultiwordexpressions

kwic(data_corpus_inaugural,pattern=phrase("UnitedStates"),window=3)%>%head(3)

####[1789-Washington,434:435]peopleofthe|UnitedStates|##[1789-Washington,530:531]thoseofthe|UnitedStates|##[1797-Adams,525:526]Constitutionofthe|UnitedStates|####aGovernmentinstituted##.Everystep##inaforeign

46

Exercise1.Inwhatcontextwere"God"and"Godbless"usedinUSpresidentialinauguralspeechesafter1970?

2.Lookupkeywords-in-contextforgod,godbless(wrappingthelatterinphrase())inonekwic()call.

3.BONUS:Sendtheresultsofthekwicoutputtotextplot_xray().

47

Solutionkw_god<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic("god",window=3)

kw_god_bless<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic(phrase("godbless"),window=3)

tail(kw_god_bless)

####[2005-Bush,2304:2305]freedom.May|Godbless|you,and##[2009-Obama,2699:2700]Thankyou.|Godbless|you.And##[2009-Obama,2704:2705]you.And|Godbless|theUnitedStates##[2013-Obama,2303:2304]Thankyou.|Godbless|you,and##[2017-Trump,1652:1653]Thankyou,|Godbless|you,and##[2017-Trump,1657:1658]you,and|Godbless|America.

48

Solution(cont.)textplot_xray(kw_god,kw_god_bless)

49

Settingupaprojectandimportingtext�les

SettingupaprojectinRStudio1.File->NewProject

2.Choseexistingfolderorcreatenewfolder

3.Createfoldersintheproject,e.g.:code,data,output

4.Openthe.Rprojfile:noneedtospecifyworkingdirectory(setwd())!

51

Importmultipletext�les#loadthereadtextpackagelibrary(readtext)

#data_diristhelocationofsamplefilesonyourcomputer.data_dir<-system.file("extdata/",package="readtext")

eu_data<-readtext(paste0(data_dir,"/txt/EU_manifestos/*.txt"),docvarsfrom="filenames",docvarnames=c("unit","context","year","language","party"),dvsep="_",encoding="ISO-8859-1")

52

Structureofreadtextdataframehead(eu_data)

##readtextobjectconsistingof6documentsand5docvars.###Description:df[,7][6×7]##doc_idtextunitcontextyearlanguageparty##*<chr><chr><chr><chr><int><chr><chr>##1EU_euro_2004_de_PSE…"\"PES·PSE\".…EUeuro2004dePSE##2EU_euro_2004_de_V.t…"\"Gemeinsame\".…EUeuro2004deV##3EU_euro_2004_en_PSE…"\"PES·PSE\".…EUeuro2004enPSE##4EU_euro_2004_en_V.t…"\"Manifesto\n\"…EUeuro2004enV##5EU_euro_2004_es_PSE…"\"PES·PSE\".…EUeuro2004esPSE##6EU_euro_2004_es_V.t…"\"Manifesto\n\"…EUeuro2004esV

53

Checkencodingfile<-system.file("extdata/txt/EU_manifestos/EU_euro_2004_de_V.txt",package="readtext")

myreadtext<-readtext(file)encoding(myreadtext)

##readtextobjectconsistingof1documentand0docvars.###Description:df[,2][1×2]##doc_idtext##<chr><chr>##1EU_euro_2004_de_V.txt"\"Gemeinsame\"..."

##Probableencoding:ISO-8859-1##(butnote:detectoroftenreportsISO-8859-1whenencodingisactuallyUTF-8.)

54

Exercise1.SetupaRProjinanewfolder.

2.DownloadZIPfilewithinauguralspeechesbyGermanchancellors(https://tinyurl.com/corp-regierung)

3.CopythefolderintoyourRProjfolder.

4.Importalltextfilesusingreadtext().(Hint:"name_of_folder/*"readsinallfiles)

5.Createatextcorpusofthisdataframe.

55

https://tinyurl.com/corp-regierung

Solutionlibrary(readtext)data_inaugural_ger<-readtext("data/inaugural_germany/*",encoding="UTF-8",docvarsfrom="filenames",dvsep="-",docvarnames=c("Year","Chancellor","Party"))

data_corpus_inaugural_ger<-corpus(data_inaugural_ger)

summary(data_corpus_inaugural_ger,n=4)

##Corpusconsistingof21documents,showing4documents:####TextTypesTokensSentencesYearChancellorParty##1949-Adenauer-CDU.txt213674142921949AdenauerCDU##1953-Adenauer-CDU.txt263097084031953AdenauerCDU##1957-Adenauer-CDU.txt220276082961957AdenauerCDU##1961-Adenauer-CDU.txt248487964021961AdenauerCDU####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:232019##Notes:

56

Exercise1.Createbetterdocnamesbypasting"Year"and"Chancellor"

2.SelectspeechesbyGerhardSchröderandAngelaMerkel.

3.Check?textplot_xray().

4.Choosewordsthatmightonlyappearinsomeofthespeeches.

57

Solutiondocnames(data_corpus_inaugural_ger)<-paste(docvars(data_corpus_inaugural_ger,"Year"),docvars(data_corpus_inaugural_ger,"Chancellor"),sep=",")

##alternative

docnames(data_corpus_inaugural_ger)<-paste(data_inaugural_ger$Year,data_inaugural_ger$Chancellor,sep="")

corp_schroeder_merkel<-data_corpus_inaugural_ger%>%corpus_subset(Chancellor%in%c("Merkel","Schroeder"))

plot_xray<-textplot_xray(kwic(corp_schroeder_merkel,"Arbeits*"),kwic(corp_schroeder_merkel,"Arbeitslos*"),kwic(corp_schroeder_merkel,"Migration*"))

58

Solution(cont.)plot_xray

59

Tokenization

Tokenizationofacorpuscorp_immig<-corpus(data_char_ukimmig2010)toks_immig<-tokens(corp_immig)

head(toks_immig[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

61

Removepunctuationtoks_nopunct<-tokens(data_char_ukimmig2010,remove_punct=TRUE)

head(toks_nopunct[[1]],20)

##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"

62

Removestopwordstoks_nostopw<-tokens_select(toks_immig,stopwords("en"),selection="remove")#Note:equivalenttotoks_nostopw<-tokens_remove(toks_immig,stopwords("en"))

head(toks_nostopw[[1]],20)

##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"

63

CustomizestopwordlistStopwordsforotherlanguges:checkthestopwordspackage

Removefeaturefromstopwordlist

"will"%in%stopwords("en")

##[1]TRUE

my_stopwords_en<-stopwords("en")[!stopwords("en")%in%c("will")]

"will"%in%my_stopwords_en

##[1]FALSE

64

https://stopwords.quanteda.io/

head(toks_immig[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

head(toks_nopunct[[1]],20)

##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"

head(toks_nostopw[[1]],20)

##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"

65

Selectcertaintermstoks_immig_select<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=TRUE)head(toks_immig_select[[1]],20)

##[1]"IMMIGRATION"""""""""##[6]""""""""""##[11]""""""""""##[16]"immigration"""""""""

toks_immig_no_pad<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=FALSE)head(toks_immig_no_pad[[1]],20)

##[1]"IMMIGRATION""immigration""immigration""immigrants""immigration"##[6]"immigration""immigration""immigrants""immigrant""immigrant"##[11]"immigrants""immigrants""immigration""Immigration""immigration"##[16]"immigrants""Immigration""Immigration""Immigration""immigrants"

66

Selectcertaintermsandtheircontext#specifynumberofsurroundingwordsusingwindowtoks_immig_window<-tokens_select(toks_immig,pattern=c('immig*','migra*'),padding=TRUE,window=5)head(toks_immig_window[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH"""""##[9]"""""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

67

IfPol

Notiz

Möglichkeit die Worte um die Suchbegriffe auszusuchen um beispielsweise die Stimmung um die Wörter "Migration" etc. zu analysieren ähnlich wie kwic

Compoundmultiwordsexpressionskw_multi<-kwic(data_char_ukimmig2010,phrase(c('asylumseeker*','britishcitizen*')))head(kw_multi,5)

####[BNP,1724:1725]thehonourandbenefitof|Britishcitizenship|##[BNP,1958:1959]allillegalimmigrantsandbogus|asylumseekers|##[BNP,2159:2160]regionconcerned.An'|asylumseeker|##[BNP,2192:2193]country.Becauseevery'|asylumseeker|##[BNP,2218:2219]therearecurrentlynolegal|asylumseekers|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed##inBritaintoday.It

68

IfPol

Notiz

wenn er die Begriffe findet, fügt er sie zu einem token zusammen wie United Kingdom wie Europäische Union

CompoundmultiwordsexpressionsPreservetheseexpressionsinbag-of-wordanalysis:

toks_comp<-tokens(data_char_ukimmig2010)%>%tokens_compound(phrase(c('asylumseeker*','britishcitizen*')))

kw_comp<-kwic(toks_comp,c('asylum_seeker*','british_citizen*'))head(kw_comp,4)

####[BNP,1724]thehonourandbenefitof|British_citizenship|##[BNP,1957]allillegalimmigrantsandbogus|asylum_seekers|##[BNP,2157]regionconcerned.An'|asylum_seeker|##[BNP,2189]country.Becauseevery'|asylum_seeker|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed

69

IfPol

Notiz

wenn er die Begriffe findet, fügt er sie zu einem token zusammen wie United Kingdom wie Europäische Union

Exercise1.Tokenizedata_corpus_irishbudget2010andcompoundthefollowingpartynames:fiannafáil,finegael,andsinnféin.

2.Selectonlythethreepartynamesandthewindowof+-10words

3.Howcanweextractonlythefullsentencesinwhichatleastoneofthepartiesismentioned?

70

Solution#1.Tokenize`data_corpus_irishbudget2010`andcompoundpartynamestoks_ire<-data_corpus_irishbudget2010%>%tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))

nrow(kwic(toks_ire,"fianna_fáil"))

##[1]66

nrow(kwic(toks_ire,phrase("fiannafáil")))

##[1]0

71

Solution(cont.)#2.Selectonlythethreepartynamesandthewindowof+-10wordstoks_ire_select<-toks_ire%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=10)

head(toks_ire_select[[2]],15)

##[1]"happening""today"".""It""is"##[6]"happening"",""however"",""because"##[11]"Fianna_Fáil""failed""to""heed""the"

Note:Togetthesentencesthatmentiononeoftheparties,callcorpus_reshape(to="sentences")first,repeatthesubsequentstepsandsetaverylargewindow(e.g.,1000).

#3.Extractonlyfullsentencesinwhich#atleastonepartyismentioned

toks_ire_sentences<-data_corpus_irishbudget2010%>%corpus_reshape(to="sentences")%>%#<--importantstep!tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=1000)

72

IfPol

Notiz

corpus wird auf satzebene gespalten eine liste, die nur die sätze auflistet

IfPol

Notiz

"trick" wenn das fenster größer als der Satz ist, gibt er einem den ganzen Satz raus so kann man alle Sätze auswählen, die die Ausdrücke verwenden

N-grams#Unigramstokens("insurgentskilledinongoingfighting")

##tokensfrom1document.##text1:##[1]"insurgents""killed""in""ongoing""fighting"

#Bigramstokens("insurgentskilledinongoingfighting")%>%tokens_ngrams(n=2)

##tokensfrom1document.##text1:##[1]"insurgents_killed""killed_in""in_ongoing"##[4]"ongoing_fighting"

73

Skipgrams#Skipgramstokens("insurgentskilledinongoingfighting")%>%tokens_skipgrams(n=2,skip=0:1)

##tokensfrom1document.##text1:##[1]"insurgents_killed""insurgents_in""killed_in"##[4]"killed_ongoing""in_ongoing""in_fighting"##[7]"ongoing_fighting"

74

IfPol

Notiz

Verwendung in deep learning und pattern zu finden x

IfPol

Notiz

im ersten Lauf lässt er kein Wort aus und dann ein Wort

Lookuptokensfromadictionarytoks<-tokens(data_char_ukimmig2010)

dict<-dictionary(list(refugee=c('refugee*','asylum*'),worker=c('worker*','employee*')))print(dict)

##Dictionaryobjectwith2keyentries.##-[refugee]:##-refugee*,asylum*##-[worker]:##-worker*,employee*

dict_toks<-tokens_lookup(toks,dictionary=dict)head(dict_toks,2)

##tokensfrom2documents.##BNP:##[1]"refugee""worker""refugee""refugee""refugee""refugee""refugee"##[8]"refugee""refugee""refugee""refugee""refugee""refugee""worker"####Coalition:##[1]"refugee"

75

IfPol

Notiz

key: values (val1, val2, ...) categories with word which idicate the categorie positive: good, great, super, .... negative: bad, awful, ....

IfPol

Notiz

stopwords can be tought of as a dictionary approach

IfPol

Notiz

ein dictionary erstellen eine liste mit name

IfPol

Notiz

glob als pattern match valuetype müsste sonst noch weoter spezifizerit werden glob is default

Thetransitionfromtokens()todfm()dfm(dict_toks)

##Document-featurematrixof:9documents,2features(38.9%sparse).##9x2sparseMatrixofclass"dfm"##features##docsrefugeeworker##BNP122##Coalition10##Conservative00##Greens41##Labour44##LibDem60##PC30##SNP10##UKIP60

76

Summaryoftokensfunctionstokens()

tokens_tolower()/tokens_toupper()

tokens_wordstem()

tokens_compound()

tokens_lookup()

tokens_ngrams()

tokens_skipgrams()

tokens_select()/tokens_remove()/tokens_keep()

tokens_replace()

tokens_sample()

tokens_subset()

Recalltouse?toreadthemanualandexamplesforeachfunction.

77

IfPol

Notiz

mit einem Dictionary zum Beispiel englische schreibweisen mit amerikanischen ersetzen oder ähnliches niknames von menschen angleichen

Exercise1.Tokenizedata_corpus_irishbudget2010

2.Convertthetokensobject,removepunctuation,changetolowercase,removestopwords,andstem

3.Getthenumberoftypesandtokensperspeech

78

Solutiontoks_ire<-data_corpus_irishbudget2010%>%tokens(remove_punct=TRUE)%>%tokens_remove(stopwords("en"))%>%tokens_tolower()%>%tokens_wordstem()

ire_ntype<-ntype(toks_ire)ire_ntoken<-ntoken(toks_ire)

79

Document-featurematrix

Exercise1.Createadocument-featurematrixfromthetokensobjectabove.

2.Getthe50mostfrequenttermsusingtopfeatures().

81

Solutiontoks_ire<-tokens(data_corpus_irishbudget2010)

dfmat_ire<-dfm(toks_ire)#dfm()transformstolowercasebydefault

topfeatures(dfmat_ire)

##the.to,ofandinaisthat##35982371163315481537135912311013868804

##alternativeapproachwithouttokens()--lesscontrol!dfmat_ire2<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)

topfeatures(dfmat_ire2)

##peoplbudgetgovernministyeartaxpubliceconomicut##273272271204201195179172172##job##148

82

Selectfeaturesbasedonfrequenciesnfeat(dfmat_ire)

##[1]5140

dfmat_ire_trim1<-dfm_trim(dfmat_ire,min_termfreq=5)nfeat(dfmat_ire_trim1)

##[1]1265

dfmat_ire_trim2<-dfm_trim(dfmat_ire,min_docfreq=5)nfeat(dfmat_ire_trim2)

##[1]864

dfmat_ire_trim3<-dfm_trim(dfmat_ire,max_docfreq=0.1,docfreq_type="prop")nfeat(dfmat_ire_trim3)

##[1]2716

83

IfPol

Notiz

Seltene Wörter entfernen, da diese die Matrix deutlich vergrößern ohne viel Informationen hinzuzufügen sehr rarely used words sind "noise" da es falsch geschrieben wurde oder unwichtig sind außer man sieht sie als wichtiges Signal Beispiel: irische wörter in den Reden

IfPol

Notiz

Wörter entfernt, die weniger als 5 mal verwendet wurden

Exercise1.Createadfmfromthetwodocumentsbelow,andweightitusingdfm_weight().

2.Fordfm_weight()tryouttheschemearguments"count","boolean"and"prop".

3.Whatareadvantagesandproblemsofeachweightingscheme?

texts<-c("appleisbetterthanbanana","bananabananaapplemuchbetter")

84

IfPol

Notiz

normalizin document

IfPol

Notiz

optimized for sparse matrix

IfPol

Notiz

default dfm ist schon worthäufigkeiten

IfPol

Notiz

boolean recode all non-zero counts as 1

Solutiondfmat_fruits<-dfm(texts)dfm_weight(dfmat_fruits,scheme="count")

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101021

dfm_weight(dfmat_fruits,scheme="boolean")#

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101011

85

Solution(cont.)dfm_weight(dfmat_fruits,scheme="prop")

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text10.20.20.20.20.20##text20.200.200.40.2

86

Sentimentanalysis

WhyareweanalyzingIrishbudgetspeeches?

88

SentimentanalysisSentimentanalysisusingtheLexicoderSentimentDictionary(data_dictionary_LSD2015)

summary(data_dictionary_LSD2015)

##LengthClassMode##negative2858-none-character##positive1709-none-character##neg_positive1721-none-character##neg_negative2860-none-character

docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")

#tokenizeandapplydictionarytoks_dict<-data_corpus_irishbudget2010%>%tokens()%>%tokens_lookup(dictionary=data_dictionary_LSD2015)

#transformtoadfmdfmat_dict<-dfm(toks_dict)

89

IfPol

Notiz

Voting of MPs is in line with party affiliation but they speak differently so interessting to analyse speeches

SentimentAnalysisprint(dfmat_dict)

##Document-featurematrixof:14documents,4features(12.5%sparse).##14x4sparseMatrixofclass"dfm"##features##docsnegativepositiveneg_positiveneg_negative##Lenihan,Brian(FF)18839721##Bruton,Richard(FG)16314752##Burton,Joan(LAB)22526683##Morgan,Arthur(SF)26024952##Cowen,Brian(FF)15036812##Kenny,Enda(FG)10414633##ODonnell,Kieran(FG)498440##Gilmore,Eamon(LAB)16417630##Higgins,Michael(LAB)374210##Quinn,Ruairi(LAB)344011##Gormley,John(Green)175600##Ryan,Eamon(Green)247812##Cuffe,Ciaran(Green)385600##OCaolain,Caoimhghin(SF)15414511

90

Converttoadataframeusingconvert()dict_output<-convert(dfmat_dict,to="data.frame")

91

Estimatesentimentdict_output$sent_score<-log((dict_output$positive+dict_output$neg_negative+0.5)/(dict_output$negative+dict_output$neg_positive+0.5))

dict_output<-cbind(dict_output,docvars(data_corpus_irishbudget2010))

92

Plottingwithggplot2Specifydata(adata.frame)

Specifytheaestetics(x-axis,y-axis,colours,shapesetc.)

Chooseageometricobjects(e.g.scatterplot,boxplot)

#discovermtcarsdatasethead(mtcars,3)

##mpgcyldisphpdratwtqsecvsamgearcarb##MazdaRX421.061601103.902.62016.460144##MazdaRX4Wag21.061601103.902.87517.020144##Datsun71022.84108933.852.32018.611141

#specifydataandaxeslibrary(ggplot2)myplot<-ggplot(data=mtcars,aes(x=mpg,y=hp))

93

myplot

94

myplot+geom_point()

95

myplot+geom_point(aes(colour=factor(cyl)))+scale_colour_discrete(name="Cylinders")+scale_shape_discrete(name="Cylinders")+labs(x="Milespergallon",y="Horsepower")

96

Plotsentimenttplot_sent<-ggplot(dict_output,aes(x=reorder(name,sent_score),y=sent_score,colour=gov_opp))+geom_point(size=3)+coord_flip()+labs(x="Speaker",y="Estimatedsentiment")

97

IfPol

Notiz

Hier trägt das Problem, dass ir als begatives Wort gelistet ist daher government mitglieder negativer als sie eigentlich sind

tplot_sent

98

Textualstatistics

Simplefrequencyanalysislibrary("ggplot2")ire_dfm<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),remove_punct=TRUE,remove_symbols=TRUE)

ire_freqplot<-ire_dfm%>%textstat_frequency(n=15)%>%ggplot(aes(x=reorder(feature,frequency),y=frequency))+geom_point()+coord_flip()+labs(x=NULL,y="Frequency")

100

ire_freqplot

101

Frequencyanalysisforgroupsire_dfm_group<-ire_dfm%>%dfm_group(groups="party")

102

#plotawordcloud(notrecommended...)set.seed(34)textplot_wordcloud(ire_dfm_group,comparison=TRUE,max_words=40)

103

Relativefrequencyanalysis(keyness)"Keyness"assignsscorestofeaturesthatoccurdifferentiallyacrossdifferentcategories

docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")

#comparegovernmenttooppositionpartiesbychi^2dfmat_key<-data_corpus_irishbudget2010%>%dfm(groups="gov_opp",remove=stopwords("english"),remove_punct=TRUE,remove_symbols=TRUE)

tstat_key<-textstat_keyness(dfmat_key,target="Opposition")

tplot_key<-textplot_keyness(tstat_key,margin=0.2,n=10)

104

tplot_key

105

Termfrequency-inversedocumentfrequency(tf-idf )Example:Wehave politicalpartymanifestos,eachwith1000words.Thefirstdocumentcontains16instancesoftheword"environment"; ofthemanifestoscontaintheword"environment".

Thetermfrequencyis

Thedocumentfrequencyis ,or

Thetf-idfwillthenbe

Ifthewordhadonlyappearedonlyin ofthe manifestos,thenthetf-idfwouldbe

tf-idffiltersoutcommonterms

10040

16

100/40 = 2.5 log(2.5) = 0.398

16 × 0.398 = 6.37

15 10013.18

106

IfPol

Notiz

in PolSci nicht wirklich nutzbar da viel verwendete Terms rausgenommen werden bzw. ein niedriges gewicht bekommen

Applytf-idfweightingdfmat_ire_tfidf<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)%>%dfm_group(groups="party")%>%dfm_tfidf()

107

Plottf-idftstat_freq_tfidf<-dfmat_ire_tfidf%>%textstat_frequency(n=10,groups="party",force=TRUE)

#plotfrequenciestplot_tfidf<-ggplot(data=tstat_freq_tfidf,aes(x=factor(nrow(tstat_freq_tfidf):1),y=frequency))+geom_point()+facet_wrap(~group,scales="free_y")+coord_flip()+scale_x_discrete(breaks=factor(nrow(tstat_freq_tfidf):1),labels=tstat_freq_tfidf$feature)+labs(x=NULL,y="tf-idf")

108

tplot_tfidf

109

Collocationanalysis:astringsofwordsapproachtstat_col<-data_corpus_irishbudget2010%>%tokens()%>%textstat_collocations(min_count=20,tolower=TRUE)

head(tstat_col,7)

##collocationcountcount_nestedlengthlambdaz##1itis188023.68333836.89575##2willbe142024.13230135.59424##3thisbudget107024.29506231.81840##4wehave119023.38272129.50924##5morethan56026.29716728.82909##6socialwelfare70028.08114328.82286##7inthe347021.85505227.87367

110

IfPol

Notiz

the higher the value the lower the chance this occured by chance

IfPol

Notiz

sollte das eine multi-word expression sein?

Exercise1.Repeatthestepabove,butremovestopwords,andstemthetokensobject.

2.Comparethemostfrequentcollocations.Whathaschanged?

3.Optional:ConductacollocationanalysisfortheGermaninauguralspeeches.Doesitworkwell?

111

Solutiontoks_ire_adjusted<-data_corpus_irishbudget2010%>%tokens()%>%tokens_remove(pattern=stopwords("en"))%>%tokens_wordstem()

tstat_col_adjusted<-toks_ire_adjusted%>%textstat_collocations(min_count=5,tolower=FALSE)

head(tstat_col_adjusted,7)

##collocationcountcount_nestedlengthlambdaz##1publicservic68025.46713226.76130##2socialwelfar65027.16804625.84665##3childbenefit36026.87543121.50678##4perweek25025.82873119.40465##5nextyear34025.20099418.86080##6publicsector30024.08902817.70144##7LabourParti23027.48637517.12913

nrow(tstat_col_adjusted)

##[1]224

112

CompoundcollocationsBeforecompounding

toks_ire<-toks_ire_adjustedkwic(toks_ire,phrase("GreenParty"))%>%head(4)

##[1]pattern##<0rows>(or0-lengthrow.names)

113

CompoundcollocationsCompoundbasedoncollocations

#get50mostfrequentcollocations

#compoundbasedoncollocationstoks_ire_comp<-toks_ire%>%tokens_compound(tstat_col_adjusted[tstat_col_adjusted$z>3])

Aftercompounding

nrow(kwic(toks_ire_comp,phrase("FiannaFáil")))

##[1]0

nrow(kwic(toks_ire_comp,phrase("Fianna_Fáil*")))

##[1]70

114

CompoundcollocationsCheckcompoundedwords

toks_ire_comp%>%tokens_keep(pattern="*_*")%>%dfm()%>%topfeatures()

##fianna_fáilpublic_servicsocial_welfarnext_yearchild_benefit##6147413830##minist_financper_weekpublic_sectoryoung_peopl€_4_billion##3025252320

115

SimilaritybetweentextsCosinesimilarity

(dfmat<-dfm(c("XYYZZZ","XXXXYYYYYZZZZZZ")))

##Document-featurematrixof:2documents,3features(0.0%sparse).##2x3sparseMatrixofclass"dfm"##features##docsxyz##text1123##text2456

textstat_simil(dfmat,method="cosine")

##textstat_similobject;method="cosine"##text1text2##text11.0000.975

cos(A,B) =A ⋅ B

||A|| ⋅ ||B||

cos(A,B) =∑AiBi

√∑A2i√∑B2

i

116

IfPol

Notiz

similarity and distance between texts man muss auf die länge der texte nicht normalisieren weil sonst längere texte mehr "different" sind

Cosinesimilaritylibrary(quanteda)dfmat<-dfm(c("Iusethissentencealmosttwice.","Iincludethisalmostsentencetwice.","Andherewehaveirrelevantcontent."))

tstat_simil<-textstat_simil(dfmat,method="cosine")tstat_simil

##textstat_similobject;method="cosine"##text1text2text3##text11.0000.8570.143##text20.8571.0000.143##text30.1430.1431.000

Whyisthesimilaritybetweentext1/text2andtext3not0?

117

IfPol

Notiz

zum clustern von texten verwenden

IfPol

Notiz

es gibt keine wörter die der satz drei hat, die satz 1 oder 2 haben warum sind diese nicht maximal unterschiedlich? Satzzeichen Punkt Daher sind texte niemals maximal unterschiedlich: füllwörter, satzzeichen...

BONUS:Transformmatrixintodyadicdataframetstat_simil_matrix<-as.matrix(tstat_simil)tstat_simil_matrix[lower.tri(tstat_simil_matrix)]<-NA

tstat_simil_df<-tstat_simil_matrix%>%reshape2::melt()%>%subset(Var1!=Var2)%>%subset(!is.na(value))

tstat_simil_df

##Var1Var2value##4text1text20.8571429##7text1text30.1428571##8text2text30.1428571

118

IfPol

Notiz

es wird eine funktion geben, die das automatisch macht nicht länger notwendig

IfPol

Notiz

wenn man die daten in ggplot oder dpllyr machen möchte

Exercise1.Createadfmfromdata_corpus_irishbudget2010

2.Groupitbypartyandruntextstat_simil().

3.Whatpartiesaremostsimilar?

4.Dothesubstantiveconclusionchangewhenremovingstopwordsandpunctuation,andstemmingthedfm?

119

Solution#similaritiesfordocumentststat_simil1<-data_corpus_irishbudget2010%>%dfm()%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")

tstat_simil1

##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.9540.9450.9630.968##FG0.9541.0000.9500.9830.979##Green0.9450.9501.0000.9510.938##LAB0.9630.9830.9511.0000.980##SF0.9680.9790.9380.9801.000

120

IfPol

Notiz

keine Messung von ideologischer Nähe, sondern Ähnlichkeit von Texten

Solutiontstat_simil2<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),stem=TRUE,remove_punct=TRUE)%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")

tstat_simil2

##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.6210.6570.6660.695##FG0.6211.0000.6450.7950.791##Green0.6570.6451.0000.6550.651##LAB0.6660.7950.6551.0000.810##SF0.6950.7910.6510.8101.000

121

IfPol

Notiz

beide deutlichsten Oppositionsparteien

IfPol

Notiz

bessere Messung ohne Stopwörter und Punktuation

IfPol

Notiz

output sieht anders aus tstat_simil2 %>% as.matrix()

Lexicaldiversitydfmat_ire<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)

#estimateType-TokenRatiotstat_lexdiv<-textstat_lexdiv(dfmat_ire,measure="TTR")

df_lexdiv<-cbind(tstat_lexdiv,docvars(dfmat_ire))

head(df_lexdiv,4)

##documentTTRyeardebatenumber##Lenihan,Brian(FF)Lenihan,Brian(FF)0.21424872010BUDGET01##Bruton,Richard(FG)Bruton,Richard(FG)0.23643122010BUDGET02##Burton,Joan(LAB)Burton,Joan(LAB)0.25831872010BUDGET03##Morgan,Arthur(SF)Morgan,Arthur(SF)0.22652362010BUDGET04##forennamepartygov_opp##Lenihan,Brian(FF)BrianLenihanFFGovernment##Bruton,Richard(FG)RichardBrutonFGOpposition##Burton,Joan(LAB)JoanBurtonLABOpposition##Morgan,Arthur(SF)ArthurMorganSFOpposition

122

IfPol

Notiz

Documentation mit den verschiedenen measures TTR: Type-Token-Ratio C: R: ... in der documentation mit zitation sehr umfangreich

IfPol

Notiz

statistics

IfPol

Notiz

measure = c("all") gibt dann alle meausres wieder

PlotType-TokenRatiotplot_ttr<-ggplot(df_lexdiv,aes(x=reorder(document,TTR),y=TTR))+geom_point()+coord_flip()+labs(x=NULL)

123

tplot_ttr

124

IfPol

Notiz

kürzere texte haben eine höhere diversity als längere texte bei längeren texten wir man sich öfter wiederholen

Readabilitytstat_read<-data_corpus_inaugural%>%textstat_readability(measure=c("Flesch.Kincaid","Dale.Chall.old"))

tplot_read<-ggplot(tstat_read,aes(x=Flesch.Kincaid,y=Dale.Chall.old))+geom_smooth()+geom_point()

125

IfPol

Notiz

also meant for spoken word

Plotrelationshipbetweenreadabilitymeasurestplot_read

126

IfPol

Notiz

0-25 25-50 50-65 over 65 the higher the easyer

Correlationbetweenreadabilitymeasurescor.test(tstat_read$Flesch.Kincaid,tstat_read$Dale.Chall.old)

####Pearson'sproduct-momentcorrelation####data:tstat_read$Flesch.Kincaidandtstat_read$Dale.Chall.old##t=19.714,df=56,p-value<2.2e-16##alternativehypothesis:truecorrelationisnotequalto0##95percentconfidenceinterval:##0.89202470.9611137##sampleestimates:##cor##0.9349112

127

Readabilityofinauguralspeechessince1945tstat_read_subset<-data_corpus_inaugural%>%corpus_subset(Year>1948)%>%textstat_readability(measure="Flesch.Kincaid")

tplot_read_us<-ggplot(tstat_read_subset,aes(x=reorder(document,Flesch.Kincaid),y=Flesch.Kincaid))+geom_point()+coord_flip()+labs(x=NULL,y="Readability(Flesch-Kincaid)")

128

tplot_read_us

SeeextensivelyBenoitetal.(2019)

129

https://kenbenoit.net/pdfs/BMS_AJPS_2019.pdf

Exercise:Wrappingitallup1.Usedata_corpus_guardianfromthequanteda.corporapackageusing:data_corpus_guardian<-quanteda.corpora::download('data_corpus_guardian').

2.Createatokensobjectandremovestopwordsandpunctuation

3.Keeponlytheterm"Brexit*"andawindowof5words.

4.Usefcm()andcreateafeatureco-occurencematrix.

5.Usetextplot_network()toplotanetworkofco-occurencesofthe40mostfrequentterms.

130

SolutionImportGuardiancorpususingquanteda.corpora'sdownload()function.

data_corpus_guardian<-download('data_corpus_guardian')

toks_brexit<-data_corpus_guardian%>%tokens(remove_punct=TRUE,remove_symbols=TRUE)%>%tokens_remove(stopwords("en"),padding=FALSE)%>%tokens_keep(pattern="Brexit*",window=5,case_insensitive=TRUE)%>%tokens_tolower()

fcmat_brexit<-toks_brexit%>%fcm(context="window")

feat<-names(topfeatures(fcmat_brexit,40))

fcmat_brexit_subset<-fcmat_brexit%>%fcm_select(feat)

tplot_brexit<-textplot_network(fcmat_brexit_subset)

##RegisteredS3methodoverwrittenby'network':##methodfrom##summary.characterquanteda 131

IfPol

Notiz

feature coocurence matrix

set.seed(134)tplot_brexit

132

IfPol

Notiz

random element deswegen set.seed um gleiches ergebnis zu erhalten plotting erhält random element es ist eine wolke, je nachdem wie man darauf schaut, sieht es anders aus

Supervisedtextclassi�cation

Supervisedclassi�cation:separatedocumentsintopre-existingcategoriesWeneed:

1.Hand-codeddataset(labeled),tobesplitintoatrainingset(usedtotraintheclassifier)andavalidation/testset(usedtovalidatetheclassifier)

2.Methodtoextrapolatefromhandcodingtounlabeleddocuments(classifier):NaiveBayes,regularizedregression,SVM,K-nearestneighbours,ensemblemethods

3.Approachtovalidateclassifier:cross-validation

4.Performancemetrictochoosebestclassifierandavoidoverfitting:confusionmatrix,accuracy,precision,recall,F1scores

134

IfPol

Notiz

part of quanteda

Goalsofsupervisedclassi�cation1.Flexibility

2.Efficiency

3.Interpretability

135

IfPol

Notiz

prediction zu welcher kategorie ein text gehört

Unsupervisedclassi�cationUnsupervisedmethodsscaledocumentsbasedonpatternsofsimilarityfromtheterm-documentmatrix,withoutrequiringatrainingstep

Examples:Wordfish,topicmodels(tobecoveredsoon)

Relativeadvantage:Youdonothavetoknowthedimensionbeingscaled(alsoadisadvantage!)

136

IfPol

Notiz

unsupervised um dinge zu entdecken supervised um antworten zu bekommen

Basicprinciplesofsupervisedlearning1.Generalisation:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsnotonlyinpreviouslyseensamplesbutalsoinpreviouslyunseensamples

2.Overfitting:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsinpreviouslyseensamplesbutfailstodosoinpreviouslyunseensamples.Thiscausespoorprediction/generalisation.

137

Howtoassesstheperformanceofaclassi�er?

138

Importantmeasures

Accuracy:

Precision:

Recall:

F1score:

=Correctly classified

Total number of cases

true positives + true negatives

Total number of cases

true positives

true positives + false positives

true positives

true positives + false negatives

2 × Precision × RecallPrecision + Recall

139

Howdowegetalabelledtrainingset?1.Externalsourcesofannotation

2.Expertannotation

3.Crowd-sourcedcoding

Wisdomofcrowds--aggregatedjudgmentsbynon-expertsconvergetojudgmentsofexperts,dependingondifficultyofclassificationtaks(Benoitetal.2016)

Platforms:Figure8(previouslycalledCrowdFlower)orAmazon'sMTurk

140

NaiveBayes:ProbabilisticlearningIntuition:Ifweobservetheterm"fantastic""inatext,howlikelyisthistextapositivereview?

1.Determinefrequencyofterminpositiveandnegativereviews(prior).

2.Assessproabilityoffeaturesgivenaparticularclass.

3.Getprobabiliyofadocumentbelongingtoeachclass(posterior).

4.Whichposteriorishighest?

141

NaiveBayesAdvantages

Simple,fast,effective

Relativelysmalltrainingsetrequired(ifclassesnotooimbalanced)

Easytoobtainprobabilities

Disadvantages

Assumptionofconditionalindependenceisproblematic

Iffeatureisnotintrainingset,itisdisregardedfortheclassification

142

IfPol

Notiz

die annahmen werden von sprache gebrochen, aber es funktioniert trotzdem daher der name naive

NaiveBayesinquanteda:Example1#1.gettrainingsetdfmat_train<-dfm(c("positivebadnegativehorrible","greatfantasticnice"))class<-c("neg","pos")

#2.trainmodeltmod_nb<-textmodel_nb(dfmat_train,class)

#3.getunlabelledtestsetdfmat_test<-dfm(c("badhorriblenegativeawful","nicebadgreat","greatfantastic"))

#4.predictclasspredict(tmod_nb,dfmat_test,force=TRUE)

##Warning:1featureinnewdatanotusedinprediction.

##text1text2text3##negpospos##Levels:negpos

143

NaiveBayesinquanteda:Example2Goal:predictwhetherUSinauguralspeechwasdeliveredbeforeorafter1945

#createuniqueIDdocvars(data_corpus_inaugural,"id")<-1:ndoc(data_corpus_inaugural)

#createdummy(before/after1945)docvars(data_corpus_inaugural,"pre_post1945")<-ifelse(docvars(data_corpus_inaugural,"Year")>1945,"Post1945","Pre1945")

#sample35speechesset.seed(123)speeches_select<-sample(1:58,size=35,replace=FALSE)

#createtrainingandtestsetsdfmat_train<-data_corpus_inaugural%>%corpus_subset(id%in%speeches_select)%>%#speechiniddfm()

dfmat_test<-data_corpus_inaugural%>%corpus_subset(!id%in%speeches_select)%>%#speechnotiniddfm()

144

NaiveBayesinquanteda:Example2#numberofdocumentsintrainingsetndoc(dfmat_train)

##[1]35

#trainNaiveBayesmodeltmod_nb<-textmodel_nb(dfmat_train,y=docvars(dfmat_train,"pre_post1945"))

#numberofdocumentsintestsetndoc(dfmat_test)

##[1]23

#predictclassofheld-outtestsettmod_nb_predict<-predict(tmod_nb,newdata=dfmat_test,force=TRUE)#setforcetoTRUEorusedfm_match

##Warning:1268featuresinnewdatanotusedinprediction.

145

NaiveBayesinquanteda:Example2#createcross-tabletab<-table(actual=docvars(dfmat_test,"pre_post1945"),predicted=tmod_nb_predict)

print(tab)

##predicted##actualPost1945Pre1945##Post194560##Pre1945314

146

NaiveBayesinquanteda:Example2Assessaccuracyusingthecaretpackage

library(caret)performance<-caret::confusionMatrix(tab)

#getmeasuresofclassificationperformanceperformance$byClass

##SensitivitySpecificityPosPredValue##0.66666671.00000001.0000000##NegPredValuePrecisionRecall##0.82352941.00000000.6666667##F1PrevalenceDetectionRate##0.80000000.39130430.2608696##DetectionPrevalenceBalancedAccuracy##0.26086960.8333333

147

Topicmodels

Requiredpackages:topicmodelsandstminstall.packages("stm")install.packages("topicmodels")

packageVersion("topicmodels")

##[1]'0.2.8'

packageVersion("stm")

##[1]'1.3.3'

149

IfPol

Notiz

quanteda zur Textverarbeitung um es dann an ein anderes paket zu "füttern"

Whataretopicmodels?Algorithmtofindmostimportant"themes/frames/topics"inacorpus

Furtherinformationortrainingdatanotnecessary(entirelyunsupervisedmethod)

Researcheronlyneedstospecifynumberoftopics(moredifficultthanitsounds!)

150

Examplesofastudyusingtopicmodels

Catalinac(2016)AmyCatalinac(2016).FromPorktoPolicy:TheRiseofProgrammaticCampaigninginJapaneseElections.TheJournalofPolitics78(1):1–18.

Japaneseelectoralsystem

Priorto1994:Single-nontransferable-voteinmultimemberdistricts(SNTV-MMD)

Since1994:mixed-membermajoritarian(MMM)electoralsystem

Expectatations:

Morepork-barrelpoliticsunderSNTV-MMD

UnderSNTV-MMD,candidatesfacinghigherlevelsofintrapartycompetitionreliedmoreonpork-barrelpolitics

152

https://scholar.harvard.edu/files/amycatalinac/files/catalinac_nov2014.pdf

Catalinac(2016):ResearchdesignAJapanesecandidatemanifesto

153

Catalinac(2016):Results

154


155


156

AssumptionsEachdocumentconsistsofamixtureoftopics(drawnfromLDAdistribution)

Eachtopiciscorrelatedmorestronglywithcertainterms

Mosttopicmodelsrelyonbag-of-words:orderofwordswithindocumentisirrelevant

157

Assumptions

158

ExamplefromBlei(2012)

159

LatentDirichletAllocationThemostcommonformoftopicmodelingisLatentDirichletAllocationorLDA.LDAworksasfollows:

1.SpecifcyK(numberoftopics).2Eachwordinthedfmisassignedrandomlytooneofthetopics(involvesaDirichletdistribution).Numbersassignedacrossthetopicsaddupto1.

2.Topicassignmentsforeachwordareupdatedinaniterativefashionbyupdatingtheprevalenceofthewordacrossthetopics,aswellastheprevalenceofthetopicsinthedocument.

3.Assignmentstopsafteruser-specifiedthreshold,orwheniterationsbegintohavelittelimpactontheprobabilitiesassignedtowordsinthecorpus.

4.Output:1.Identifywordsthataremostfrequentlyassociatedwitheachofthetopics.

2.Probabilitythateachdocumentwithinthecorpusisassociatedwitheachofthetopics.Probabilitiesaddupto1.

Source:TutorialbyChrisBail.

160

https://cbail.github.io/SICSS_Topic_Modeling.html

Example:CMU2008PoliticalBlogCorpusURL:http://www.sailing.cs.cmu.edu/main/?page_id=710

#loadpackageslibrary(quanteda)library(readtext)library(stm)

#urlofblogpostsurl_blogs<-"https://uclspp.github.io/datasets/data/poliblogs2008.zip"

#downloadblogpostsdat_blogs<-readtext(url_blogs)

#createacorpuscorp_blogs<-corpus(dat_blogs,text_field="documents")

#sample1,500documentsset.seed(134)corp_blogs_sample<-corpus_sample(corp_blogs,size=1500)

161

http://www.sailing.cs.cmu.edu/main/?page_id=710

STMexample:createcorpussummary(corp_blogs_sample,6)

##Corpusconsistingof1500documents,showing6documents:####TextTypesTokensSentencestextdocname##poliblogs2008.csv.131321623221113132tpm0834400_0.text##poliblogs2008.csv.8362196324108362ha0831900_11.text##poliblogs2008.csv.8618246449148618ha0834700_3.text##poliblogs2008.csv.834419353083at0801200_4.text##poliblogs2008.csv.7206169266117206ha0821900_5.text##poliblogs2008.csv.111244167212711124tp0829400_9.text##ratingdayblog##Liberal344tpm##Conservative319ha##Conservative347ha##Conservative12at##Conservative219ha##Liberal294tp####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:562019##Notes:

162

STMexample:createdfmandremovefeaturesdfmat_blogs<-dfm(corp_blogs_sample,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE,remove_numbers=TRUE)

dfmat_blogs_trim<-dfm_trim(dfmat_blogs,min_termfreq=10,min_docfreq=5)

163

IfPol

Notiz

bei topic models ist die Textverarbeitung super wichtig stopwords punctuation selten genutzte Worte (trimming)

STMexample:converttostm#convertdfmtostmformatstm_blogs<-convert(dfmat_blogs_trim,to="stm",docvars=docvars(dfmat_blogs))

164

IfPol

Notiz

convert function in quanteda format for topic models data.frame ..etc.

RunningtheSTMUsethestm()function.Kspecifiesthenumberoftopics

WithK=0thestmpackageselectsthenumberoftopicsautomatically(basedonalgorithmdevelopedbyLeeandMimno(2014))

Specifycovariateswithprevalence

165

https://www.aclweb.org/anthology/D14-1138

RunningtheSTM#runstructuraltopicmodelstm_object<-stm(documents=stm_blogs$documents,vocab=stm_blogs$vocab,data=stm_blogs$meta,prevalence=~rating+s(day),K=20,seed=12345)

##BeginningSpectralInitialization##Calculatingthegrammatrix...##Findinganchorwords...##....................##Recoveringinitialization...##........................................##Initializationcomplete.##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration1(approx.perwordbound=-7.352)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration2(approx.perwordbound=-7.136,relativechange=2.942e-02)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration3(approx.perwordbound=-7.098,relativechange=5.204e-03)

166

plot(stm_object)

167

Estimatee�ects#pickandlabeltopicsofinteresttopics_of_interest<-c(3,6,16,18)topic_labels<-c("Obama","FinancialCrisis","Bush","Iraq")

168

IdentifytopicsofinterestlabelTopics(stm_object,c(3,6))

##Topic3TopWords:##HighestProb:media,say,think,peopl,go,know,get##FREX:media,matthew,rush,fox,beck,chris,guy##Lift:bachmann,limbaugh,beck,hardbal,matthew,luntz,o'reilli##Score:bachmann,beck,matthew,limbaugh,clinton,hillari,o'reilli##Topic6TopWords:##HighestProb:time,report,year,new,news,global,warm##FREX:warm,climat,ice,global,coverag,newspap,ap##Lift:sheet,sulzberg,temperatur,solar,warm,murdoch,pinch##Score:sheet,climat,ice,warm,temperatur,sulzberg,solar

169

IdentifytopicsofinterestlabelTopics(stm_object,c(16,18))

##Topic16TopWords:##HighestProb:bush,iraq,war,presid,administr,said,year##FREX:bush,saddam,war,hussein,iraq,georg,administr##Lift:gonzal,wmd,saddam,shoe,hussein,powel,colin##Score:bush,iraq,gonzal,saddam,powel,iraqi,rice##Topic18TopWords:##HighestProb:one,like,can,time,just,think,even##FREX:film,love,movi,dowd,rememb,allen,exit##Lift:pole,filmmak,film,movi,documentari,tuzla,ala##Score:pole,film,dowd,movi,allen,mom,hollywood

170

MeaningsofOutputHighestProbability:Wordswithhighestprobabiliyofoccuringintopic

FREX:"FrequencyandExclusive"--wordsthatdistinguishtopicfromallothertopics

Lift&Score:Indicatorsfromlda/maptpxpackages

171

Findrepresentativedocumentsoftopic#note:outputtoolongforslidefindThoughts(stm_object,texts(corp_blogs_sample),topics=topics_of_interest,n=3)

172

Estimatee�ectofcovariates#estimateeffectstm_effects<-estimateEffect(topics_of_interest~rating+s(day),stmobj=stm_object,meta=stm_blogs$meta,uncertainty="Global")

summary(stm_effects)

173

Plotin�uenceofdiscretevariableplot(stm_effects,covariate="rating",topics=topics_of_interest,model=stm_object,method="difference",cov.value1="Liberal",cov.value2="Conservative",xlim=c(-0.1,0.1),labeltype="custom",custom.labels=topic_labels,#payattentiontolabelling!main="EffectofConservativevs.Liberal",xlab="MoreConservative...MoreLiberal")

plot_effect_libcon<-recordPlot()plot.new()

174

plot_effect_libcon

175

Plote�ectofcontinuousvariableplot(stm_effects,covariate="day",method="continuous",topics=topics_of_interest,model=stm_object,labeltype="custom",custom.labels=topic_labels,xaxt="n",xlab="Time",main="TopicPrevalenceOverTime")#setthex-axislabelsasmonthnamesblog_dates<-seq(from=as.Date("2008-01-01"),to=as.Date("2008-12-01"),by="month")

axis(1,blog_dates-min(blog_dates),labels=months(blog_dates))

plot_effect_time<-recordPlot()plot.new()

176

plot_effect_time

177

Create(somewhat)better/tidyierplotstidystm:https://github.com/mikajoh/tidystm

stminsights:https://github.com/cschwem2er/stminsights

178

https://github.com/mikajoh/tidystm

https://github.com/cschwem2er/stminsights

ReferencesandusefuldescriptionsMargaretRoberts(2017):StructuralTopicModels

ChrisBail(2018):TopicModelling

179

http://ica-cm.org/wp-content/uploads/2017/05/roberts_topicmodels_combo.pdf

https://cbail.github.io/SICSS_Topic_Modeling.html

InstallpackagesWewillworkwiththedplyrandggplot2packages.

install.packages("dplyr")install.packages("ggplot2")

Havinginstalledthepackage,weneedtoloadthemintheRenvironment.

library(dplyr)library(ggplot2)

180

Gapminder:oursampledatasetWewillusethegapminderdatasetthroughoutthiscourse.

install.packages("gapminder")

library(gapminder)

181

GettingtoknowadatasetUnderstandthestructureandvariablecodingofadataset

str(gapminder)

##Classes'tbl_df','tbl'and'data.frame':1704obs.of6variables:##$country:Factorw/142levels"Afghanistan",..:1111111111...##$continent:Factorw/5levels"Africa","Americas",..:3333333333...##$year:int1952195719621967197219771982198719921997...##$lifeExp:num28.830.3323436.1...##$pop:int84253339240934102670831153796613079460148803721288181613867957163179212##$gdpPercap:num779821853836740...

182

Printthe�rstsixrowshead(gapminder)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1AfghanistanAsia195228.88425333779.##2AfghanistanAsia195730.39240934821.##3AfghanistanAsia196232.010267083853.##4AfghanistanAsia196734.011537966836.##5AfghanistanAsia197236.113079460740.##6AfghanistanAsia197738.414880372786.

Wecanfindoutmoreinformationonthedatausing?gapminder.

183

DatawranglingwithdpylrImportantfunctions:

select:pickcolumnsbyname

filter:keeprowsmatchingcriteria

arrange:reorderrows

mutate:addnewvariables

group_by:grouprowsbycategory

summarise:reducevariablestovalues

Datavisualisationwithggplot2

184

Datawrangling:Sampledataframe(df<-data.frame(colour=c("green","black","black","green","black","red"),value=1:6))

##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6

185

Filter/subsetadataframefilter(df,colour=="green")

##colourvalue##1green1##2green4

186

Applyingthepipe(%>%)operatordf%>%filter(colour!="red")%>%filter(value>=5)

##colourvalue##1black5

df%>%filter(colour!="red"&value>=5)

##colourvalue##1black5

187

Exercise1.Loadgapminderdataset(library(gapminder).

2.Type?gapminderandView(gapminder)toinspectthedata.

3.FilteronlyobservationsfromNorway(callnewobjectdat_nor)

4.FilteronlyobservationsfromEuropeorAfrica(gap_europe_africa)

188

Solution#loadgapminderdatalibrary(gapminder)

#variablenamesnames(gapminder)

##[1]"country""continent""year""lifeExp""pop""gdpPercap"

#numberofrowsandcolumnsdim(gapminder)

##[1]17046

189

Solutiondat_nor<-gapminder%>%filter(country=="Norway")

nrow(dat_nor)

##[1]12

gap_europe_africa<-gapminder%>%filter(continent%in%c("Europe","Africa"))

190

Datavisualisationwithggplot2Basicstructure:

1.Callggplot()

2.Specifydatawithdata

3.Specifycoordinatesystem/axeswithaes()

plot1<-ggplot(data=gapminder,aes(x=year,y=lifeExp))

191

plot1

192

plot1+geom_point()

193

Hands-onExercise1.Usegapminder,createdataframewithonlywithobservationsfromEuropeorAfricaintheyear2007

2.Createaboxplotwithcontinentonx-axisandlifeExpony-axis

194

Solutiondat_europe_africa_2007<-gapminder%>%filter(continent%in%c("Europe","Africa")&year==2007)

plot_boxplot<-ggplot(dat_europe_africa_2007,aes(x=continent,y=lifeExp))+geom_boxplot()

195

plot_boxplot

196

plot_boxplot+labs(x="Continent",y="LifeExpectancyin2007")

197

GeneraladviceSaveplotsasPDF(vectorgraphicandsmall):

Settheggplot2themeglobally

#exampleforsavingaplotggsave("boxplot.pdf",plot_boxplot,width=5,height=7)

#exampleforsettingathemeggplot2::theme_set(theme_minimal())

198

Densityplotsplot_density<-ggplot(dat_europe_africa_2007,aes(x=lifeExp,fill=continent))+geom_density(alpha=0.7)+labs(x="LifeExpectancyin2007")

199

plot_density

200

plot_density+scale_x_continuous(limits=c(0,100))+scale_fill_discrete(name="Continent")+theme(legend.position="bottom")

201

Boxplotswithgroupsgapminder_post_1980<-gapminder%>%filter(year>1980&continent!="Oceania")

plot_boxes<-ggplot(data=gapminder_post_1980,aes(x=as.factor(year),y=lifeExp))+geom_boxplot()+labs(x="Year",y="Lifeexpectancy")

202

plot_boxes

203

Addfacetsforafactorvariableplot_boxes+facet_wrap(~continent)

204

Arrange/reorderadatasetgap_arranged<-gapminder%>%arrange(lifeExp)#ascending

head(gap_arranged)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1RwandaAfrica199223.67290203737.##2AfghanistanAsia195228.88425333779.##3GambiaAfrica195230284320485.##4AngolaAfrica195230.042320953521.##5SierraLeoneAfrica195230.32143249880.##6AfghanistanAsia195730.39240934821.

205

Arrangedatasetgap_arranged_desc<-gapminder%>%arrange(-lifeExp)#descending

head(gap_arranged_desc)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1JapanAsia200782.612746797231656.##2HongKong,ChinaAsia200782.2698041239725.##3JapanAsia20028212706584128605.##4IcelandEurope200781.830193136181.##5SwitzerlandEurope200781.7755466137506.##6HongKong,ChinaAsia200281.5676247630209.

206

Selectvariablesgapminder_select<-gapminder%>%select(year,country,lifeExp)

head(gapminder_select,2)

###Atibble:2x3##yearcountrylifeExp##<int><fct><dbl>##11952Afghanistan28.8##21957Afghanistan30.3

207

Reordervariableswithindataframegapminder_reordered<-gapminder%>%select(pop,year,everything())

head(gapminder_reordered,2)

###Atibble:2x6##popyearcountrycontinentlifeExpgdpPercap##<int><int><fct><fct><dbl><dbl>##184253331952AfghanistanAsia28.8779.##292409341957AfghanistanAsia30.3821.

208

Exercise1.Selectcountry,continent,year,lifeExpandpop

2.Arrangegapminderascendingbyyearandwithinyeardescendingbypopulation.

209

Solutiongapminder_arranged<-gapminder%>%select(-gdpPercap)%>%arrange(year,-pop)

#checkView(gapminder_arranged)

210

Groupdatasetandsummarisewithingroupsdf

##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6

#getsumofvaluevariabledf%>%summarise(sum=sum(value))

##sum##121

211

Addgroupingvariabledf%>%group_by(colour)%>%summarise(sum=sum(value))

###Atibble:3x2##coloursum##<fct><int>##1black10##2green5##3red6

212

Groupandsummarisedataframedf%>%group_by(colour)%>%summarise(total=n(),sum=sum(value))

###Atibble:3x3##colourtotalsum##<fct><int><int>##1black310##2green25##3red16

213

Exercise1.Usegapminder.

2.Groupbyyear.

3.CalculatetheaveragelifeexpectancyandthemaximumGDPperyear.

214

Solutiondata_gap_summarised<-gapminder%>%group_by(year)%>%summarise(life_exp_mean=mean(lifeExp),gdp_max=max(gdpPercap))

data_gap_summarised

###Atibble:12x3##yearlife_exp_meangdp_max##<int><dbl><dbl>##1195249.1108382.##2195751.5113523.##3196253.695458.##4196755.780895.##5197257.6109348.##6197759.659265.##7198261.533693.##8198763.231541.##9199264.234933.##10199765.041283.##11200265.744684.##12200767.049357.

215

Createnewvariableswithmutatenames(gapminder)

##[1]"country""continent""year""lifeExp""pop""gdpPercap"

gapminder<-gapminder%>%group_by(year)%>%mutate(life_exp_mean=mean(lifeExp),diff_country_mean=lifeExp-life_exp_mean)summary(gapminder$diff_country_mean)

##Min.1stQu.MedianMean3rdQu.Max.##-40.561-9.5831.2930.00010.24023.612

216

FurtherinformationWickham,HadleyandGarrettGrolemund(2017).RforDataScience:Import,Tidy,Transform,Visualize,andModelData.Sebastopol:O’Reilly.

Healy,Kieran(2019).DataVisualization:APracticalIntroduction.Princeton:PrincetonUniversityPress.

217

https://r4ds.had.co.nz/

http://socviz.co/

CourseresourcesDocumentation:

https://quanteda.io

https://readtext.quanteda.io

https://spacyr.quanteda.org

Tutorials:https://tutorials.quanteda.io

Cheatsheet:https://www.rstudio.com/resources/cheatsheets/

KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.

218


https://readtext.quanteda.io/

https://spacyr.quanteda.org/

https://tutorials.quanteda.io/

https://www.rstudio.com/resources/cheatsheets/

https://doi.org/10.21105/joss.00774

Documents

University of Münster (27–28 June 2019) Kenneth Benoit