Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
QuantitativeTextAnalysisusingRandquantedaKennethBenoit
UniversityofMünster(27–28June2019)
Contents1.Introductions
2.Fundamentals
3.GettingStarted
4.Workingwithacorpus
5.Keywords-in-context
6.Importingtextfiles
7.Tokenization
8.Creatingadocument-featurematrix
9.Dictionary(sentiment)Analysis
10.TextualStatistics
11.Classification
12.Topicmodelling
13.Datawranglingandvisualisation
PurposeoftheworkshopThisworkshopisdesignedto
introducethetoolsforquantitativetextanalysisusingR
focusspecificallyonthequantedapackage
demonstrateseveralmajorcategoriesofanalysis
answeranyquestionsabouttextanalysisthatyoumighthave
3
WhomthisworkshopisforThosefamiliarwithtextanalysismethodsbutnotinR/quanteda
ThosewhohavetriedR/quantedabutwanttolearnmoreproficiency
Themerely"text-curious"
Norealpre-requisites:We'reallheretolearn
4
AboutmeKenBenoit(http://kenbenoit.net)
ProfessorofComputationalSocialScience,DepartmentofMethodology,LondonSchoolofEconomicsandPoliticalScience
CreatorandmaintainerofthequantedaRpackageandrelatedtools
FounderoftheQuantedaInitiative,anon-profitcompanyaimedatadvancingdevelopmentanddisseminationofopen-sourcetoolsfortextanalytics
Myresearch:Politicalscience(partycompetition),measurement,textanalysis
5
Examplefrommyresearch
6
Examplefrommyresearch(cont)
7
Yourturn!1.Name?
2.Department,degree?
3.Researchinterests?
4.Previousexperiencewithtextanalysis/R?
5.Whyareyouinterestedintheworkshop?
8
Assumptions,Concepts,andExamples
Work�ow,demysti�ed
10
BasicconceptsandterminologyTexts:Organizedintodocuments
Corpus:Collectionoftexts,oftenwithassociateddocument-levelmetadata,whatIwillcall"documentvariables"ordocvars
Stems:wordswithsuffixesremoved(usingasetofrules)
Lemmas:canonicalwordform
Word win winning wins won
Stem win win win won
Lemma win win win win
11
Basicconceptsandterminology(cont.)Stopwords:wordsthataredesignedforexclusionfromanyanalysisoftext
Partsofspeech:linguisticmarkersindicatingthegeneralcategoryofaword'singuisticproperty,e.g.noun,verb,adjective,etc.
Namedentities:areal-worldobject,suchaspersons,locations,organizations,products,etc.,thatcanbedenotedwithapropername,oftenaphrase,e.g."UniversityofBergen"or"UnitedKingdom"
Multi-wordexpressions:sequencesofwordsdenotingasingleconcept(andwouldbeinGerman),e.g.valueaddedtax(inGerman:Mehrwertsteuer)
12
Typesandtokenstokens:asequenceofcharactersthataregroupedtogetherasausefulsemanticunit
words
couldalsoincludepuncutationcharactersorsymbols
stemsorlemmas
multi-wordexpressions
namedentities
usually,butnotalways,delimitedbyspaces
type:auniquetoken
13
toks<-tokens(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus"),remove_punct=TRUE)toks
##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"documents"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"
tokens_wordstem(toks)
##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"document"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"
14
tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))
##tokensfrom2documents.##text1:##[1]"corpus""set""document"####text2:##[1]"second""document""corpus"
tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))%>%types()
##[1]"corpus""set""document""second"
15
Typicalwork�owsteps1.Lowercase
"acorpusisasetofdocuments."
"thisistheseconddocumentinthecorpus."
2.Removestopwordsandpunctuation
"acorpusisasetofdocuments."
"thisistheseconddocumentinthecorpus."
3.Stem
"corpussetdocuments"
"seconddocumentcorpus"
4.Tokenize
[corpus,set,document]
[second,document,corpus]
5.Createadocument-featurematrix
a"bag-of-words"conversionofdocumentsintoamatrixthatcountsthefeatures(types)bydocument
16
Document-featurematrixDocument1:"Acorpusisasetofdocuments."Document2:"Thisistheseconddocumentinthecorpus."
corpus set document second
Document1 1 1 1 0
Document2 1 0 1 1
...
Documentn 0 1 1 0
17
Gettingstarted
InstallingquantedaInstallthequantedapackagefromCRAN:
install.packages("quanteda")
Youshouldalsoinstallthereadtextpackage:
install.packages("readtext")
Afterwards,loadbothpackages:
library("quanteda")library("readtext")packageVersion("quanteda")
##[1]'1.4.5'
packageVersion("readtext")
##[1]'0.75'
19
ReproduceexampleinquantedaCreateatextcorpus
library(quanteda)#Createtextcorpuscorp<-corpus(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus."))
Exercise:Createthiscorpusandgetasummaryusingsummary(corp).
summary(corp)
##Corpusconsistingof2documents:####TextTypesTokensSentences##text1881##text2891####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:202019##Notes:
20
ReproduceexampleinquantedaCreateadocument-featurematrix
dfm(corp,remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)
##Document-featurematrixof:2documents,4features(25.0%sparse).##2x4sparseMatrixofclass"dfm"##features##docscorpussetdocumentsecond##text11110##text21011
Question:Howdoesthedfmchangewhenwechangethepreprocessingsteps?
21
ThequantedaInfrastructure
quanteda:QuantitativeAnalysisofTextualData6yearsofdevelopment,28releases
8,300commits,270,000downloads
23
Designofthepackageconsistentgrammar
flexibleforpowerusers,simpleforbeginners
analytictransparencyandreproducibility
compabilitywithotherpackages
pipelinedworkflowusingmagrittr's%>%
24
QuantedaInitiativeUKnon-profitorganizationdevotedtothepromotionofopen-sourcetextanalysissoftware
User,technicalanddevelopmentsupport
Teachingandworkshops
https://quanteda.org
25
AdditionalpackagesForsomeexercises,wewillneedquanteda.corpora:
install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")
ForPOStagging,entityrecognition,anddependencyparsing,youshouldinstallspacyr(notcoveredextensively).
install.packages("spacyr")
Installationinstructions:http://spacyr.quanteda.io
26
CourseresourcesDocumentation:
https://quanteda.io
https://readtext.quanteda.io
https://spacyr.quanteda.org
Tutorials:https://tutorials.quanteda.io
Cheatsheet:https://www.rstudio.com/resources/cheatsheets/
KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.
27
Recap:Work�ow1.Corpus
Savescharacterstringsandvariablesinadataframe
Combinestextswithdocument-levelvariables
2.Tokens
Storestokensinalistofvectors
Positional(string-of-words)analysisisperformedusingandtextstat_collocations(),tokens_ngrams()andtokens_select()orfcm()withwindowoption
3.Document-featurematrix(DFM)
Representsfrequenciesoffeaturesindocumentsinamatrix
Themostefficientstructure,butitdoesnothaveinformationonpositionsofwords
Non-positional(bag-of-words)analysisareperformedusingmanyofthetextstat_*andtextmodel_*functions
28
MainfunctionclassesTextcorpus:corpus()
Tokenization:tokens()
Document-featurematrix:dfm()
Featureco-occurencematrix:fcm()
Textstatistics:textstat_()
Textmodels:textmodel_()
Plots:textplot_()
29
NamingconventionsandusefulshortcutsRecommendednamingconventionsforobjects
Inthequantedastyleguide,werecommendtonameobjectsconsistently,e.g.:
Corpus:corp_...
Tokensobject:toks_...
Dfm:dfmat_...
Textmodel:tmod_...
UsefulRStudiokeyboardshortcuts
Insertpipeoperator(%>%):Shift+Cmd/Cntrl+M
Insertassignmentoperator(<-):Alt+-
MoreshortcutsforRStudiocanbefoundhere
30
Clari�cationThefollowingexpressionsresultinthesameoutput
data_corpus_inaugural%>%tokens()
tokens(data_corpus_inaugural)
31
32
Workingwithcorpusobjects
Corpusfunctionsinquantedacorpus()
corpus_subset()
corpus_reshape()
corpus_segment()
corpus_sample()
Corporainquanteda
data_corpus_inaugural
data_corpus_irishbudget2010
Additionalcorporainthequanteda.corporapackage
34
Usingmagrittr'spipedata_corpus_inaugural%>%corpus_subset(President=="Obama")%>%ndoc()
##[1]2
data_corpus_inaugural%>%corpus_subset(President=="Obama")%>%corpus_reshape(to="sentences")%>%ndoc()
##[1]198
35
Accessnumberoftypesandtokensofcorpusntype(data_corpus_inaugural)%>%head()
##1789-Washington1793-Washington1797-Adams1801-Jefferson##62596826717##1805-Jefferson1809-Madison##804535
ntoken(data_corpus_inaugural)%>%head()
##1789-Washington1793-Washington1797-Adams1801-Jefferson##153814725781927##1805-Jefferson1809-Madison##23811263
36
Overviewofdocument-levelvariableshead(docvars(data_corpus_inaugural))
##YearPresidentFirstName##1789-Washington1789WashingtonGeorge##1793-Washington1793WashingtonGeorge##1797-Adams1797AdamsJohn##1801-Jefferson1801JeffersonThomas##1805-Jefferson1805JeffersonThomas##1809-Madison1809MadisonJames
37
Exercise1.Basedondata_corpus_inaugural,createanobjectdata_corpus_postwar(speechessince1945).
2.Whatspeechhasmosttokens?Whatspeechhasmosttypes?
Note:Youcanfindthedocumentationandexamplesusing?followedbythenameofthefunction.
38
Solutiondata_corpus_postwar<-data_corpus_inaugural%>%corpus_subset(Year>1945)
#numberoftokensperspeechdata_corpus_postwar%>%ntoken()%>%sort(decreasing=TRUE)%>%head()
##1985-Reagan1981-Reagan1953-Eisenhower2009-Obama##2921279027572711##1989-Bush1949-Truman##26812513
#numberoftypesperspeechdata_corpus_postwar%>%ntype()%>%sort(decreasing=TRUE)%>%head()
##2009-Obama1985-Reagan1981-Reagan1953-Eisenhower##938925902900##2013-Obama1989-Bush##814795
39
Addingdocument-levelvariables#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"Order")<-1:ndoc(data_corpus_inaugural)
head(docvars(data_corpus_inaugural,"Order"))
##[1]123456
#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"PresidentFull")<-paste(docvars(data_corpus_inaugural,"FirstName"),docvars(data_corpus_inaugural,"President"),sep="")
head(docvars(data_corpus_inaugural))
##YearPresidentFirstNameOrderPresidentFull##1789-Washington1789WashingtonGeorge1GeorgeWashington##1793-Washington1793WashingtonGeorge2GeorgeWashington##1797-Adams1797AdamsJohn3JohnAdams##1801-Jefferson1801JeffersonThomas4ThomasJefferson##1805-Jefferson1805JeffersonThomas5ThomasJefferson##1809-Madison1809MadisonJames6JamesMadison
40
Exercise1.Usedata_corpus_inauguralandreshapetheentirecorpustothelevelofsentencesandstorethenewcorpus.Whatisthenumberofdocumentsofthereshapedcorpus?
2.Addadocument-levelvariabletothereshapedcorpusthatcountsthetokenspersentence.
3.Keeponlysentencesthatarelongerthan10words.
4.Reshapethecorpusbacktothelevelofdocumentsandstorethecorpusasdata_corpus_inaugural_subset.
5.Optional:findamoreefficientsolution.
41
Solutioncorp_sentences<-corpus_reshape(data_corpus_inaugural,to="sentences")
docvars(corp_sentences,"number_tokens")<-ntoken(corp_sentences,remove_punct=TRUE)
ndoc(corp_sentences)
##[1]5016
corp_sentence_subset<-corp_sentences%>%corpus_subset(number_tokens>10)ndoc(corp_sentence_subset)
##[1]4267
42
Solution(cont.)data_corpus_inaugural_subset<-corp_sentence_subset%>%corpus_reshape(to="documents")
sum(ntoken(data_corpus_inaugural))
##[1]149145
sum(ntoken(data_corpus_inaugural_subset))
##[1]142617
43
Keywords-in-context
Keywords-in-context(KWIC)Keywords-in-context(kwic())returnsalistofoneormorekeywordsanditsimmediatecontext.
kw_security<-kwic(data_corpus_inaugural,pattern="security",window=3)
head(kw_security,3)
####[1789-Washington,1497]governmentforthe|security|##[1813-Madison,321]seasandthe|security|##[1817-Monroe,1610]mayformsome|security|####oftheirunion##ofanimportant##againstthesedangers
#Checknumberofoccurrencesnrow(kw_security)
##[1]65
45
KWICwithmultiwordexpressionsUsephrase()tolookupmultiwordexpressions
kwic(data_corpus_inaugural,pattern=phrase("UnitedStates"),window=3)%>%head(3)
####[1789-Washington,434:435]peopleofthe|UnitedStates|##[1789-Washington,530:531]thoseofthe|UnitedStates|##[1797-Adams,525:526]Constitutionofthe|UnitedStates|####aGovernmentinstituted##.Everystep##inaforeign
46
Exercise1.Inwhatcontextwere"God"and"Godbless"usedinUSpresidentialinauguralspeechesafter1970?
2.Lookupkeywords-in-contextforgod,godbless(wrappingthelatterinphrase())inonekwic()call.
3.BONUS:Sendtheresultsofthekwicoutputtotextplot_xray().
47
Solutionkw_god<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic("god",window=3)
kw_god_bless<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic(phrase("godbless"),window=3)
tail(kw_god_bless)
####[2005-Bush,2304:2305]freedom.May|Godbless|you,and##[2009-Obama,2699:2700]Thankyou.|Godbless|you.And##[2009-Obama,2704:2705]you.And|Godbless|theUnitedStates##[2013-Obama,2303:2304]Thankyou.|Godbless|you,and##[2017-Trump,1652:1653]Thankyou,|Godbless|you,and##[2017-Trump,1657:1658]you,and|Godbless|America.
48
Solution(cont.)textplot_xray(kw_god,kw_god_bless)
49
Settingupaprojectandimportingtext�les
SettingupaprojectinRStudio1.File->NewProject
2.Choseexistingfolderorcreatenewfolder
3.Createfoldersintheproject,e.g.:code,data,output
4.Openthe.Rprojfile:noneedtospecifyworkingdirectory(setwd())!
51
Importmultipletext�les#loadthereadtextpackagelibrary(readtext)
#data_diristhelocationofsamplefilesonyourcomputer.data_dir<-system.file("extdata/",package="readtext")
eu_data<-readtext(paste0(data_dir,"/txt/EU_manifestos/*.txt"),docvarsfrom="filenames",docvarnames=c("unit","context","year","language","party"),dvsep="_",encoding="ISO-8859-1")
52
Structureofreadtextdataframehead(eu_data)
##readtextobjectconsistingof6documentsand5docvars.###Description:df[,7][6×7]##doc_idtextunitcontextyearlanguageparty##*<chr><chr><chr><chr><int><chr><chr>##1EU_euro_2004_de_PSE…"\"PES·PSE\".…EUeuro2004dePSE##2EU_euro_2004_de_V.t…"\"Gemeinsame\".…EUeuro2004deV##3EU_euro_2004_en_PSE…"\"PES·PSE\".…EUeuro2004enPSE##4EU_euro_2004_en_V.t…"\"Manifesto\n\"…EUeuro2004enV##5EU_euro_2004_es_PSE…"\"PES·PSE\".…EUeuro2004esPSE##6EU_euro_2004_es_V.t…"\"Manifesto\n\"…EUeuro2004esV
53
Checkencodingfile<-system.file("extdata/txt/EU_manifestos/EU_euro_2004_de_V.txt",package="readtext")
myreadtext<-readtext(file)encoding(myreadtext)
##readtextobjectconsistingof1documentand0docvars.###Description:df[,2][1×2]##doc_idtext##<chr><chr>##1EU_euro_2004_de_V.txt"\"Gemeinsame\"..."
##Probableencoding:ISO-8859-1##(butnote:detectoroftenreportsISO-8859-1whenencodingisactuallyUTF-8.)
54
Exercise1.SetupaRProjinanewfolder.
2.DownloadZIPfilewithinauguralspeechesbyGermanchancellors(https://tinyurl.com/corp-regierung)
3.CopythefolderintoyourRProjfolder.
4.Importalltextfilesusingreadtext().(Hint:"name_of_folder/*"readsinallfiles)
5.Createatextcorpusofthisdataframe.
55
Solutionlibrary(readtext)data_inaugural_ger<-readtext("data/inaugural_germany/*",encoding="UTF-8",docvarsfrom="filenames",dvsep="-",docvarnames=c("Year","Chancellor","Party"))
data_corpus_inaugural_ger<-corpus(data_inaugural_ger)
summary(data_corpus_inaugural_ger,n=4)
##Corpusconsistingof21documents,showing4documents:####TextTypesTokensSentencesYearChancellorParty##1949-Adenauer-CDU.txt213674142921949AdenauerCDU##1953-Adenauer-CDU.txt263097084031953AdenauerCDU##1957-Adenauer-CDU.txt220276082961957AdenauerCDU##1961-Adenauer-CDU.txt248487964021961AdenauerCDU####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:232019##Notes:
56
Exercise1.Createbetterdocnamesbypasting"Year"and"Chancellor"
2.SelectspeechesbyGerhardSchröderandAngelaMerkel.
3.Check?textplot_xray().
4.Choosewordsthatmightonlyappearinsomeofthespeeches.
57
Solutiondocnames(data_corpus_inaugural_ger)<-paste(docvars(data_corpus_inaugural_ger,"Year"),docvars(data_corpus_inaugural_ger,"Chancellor"),sep=",")
##alternative
docnames(data_corpus_inaugural_ger)<-paste(data_inaugural_ger$Year,data_inaugural_ger$Chancellor,sep="")
corp_schroeder_merkel<-data_corpus_inaugural_ger%>%corpus_subset(Chancellor%in%c("Merkel","Schroeder"))
plot_xray<-textplot_xray(kwic(corp_schroeder_merkel,"Arbeits*"),kwic(corp_schroeder_merkel,"Arbeitslos*"),kwic(corp_schroeder_merkel,"Migration*"))
58
Solution(cont.)plot_xray
59
Tokenization
Tokenizationofacorpuscorp_immig<-corpus(data_char_ukimmig2010)toks_immig<-tokens(corp_immig)
head(toks_immig[[1]],20)
##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","
61
Removepunctuationtoks_nopunct<-tokens(data_char_ukimmig2010,remove_punct=TRUE)
head(toks_nopunct[[1]],20)
##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"
62
Removestopwordstoks_nostopw<-tokens_select(toks_immig,stopwords("en"),selection="remove")#Note:equivalenttotoks_nostopw<-tokens_remove(toks_immig,stopwords("en"))
head(toks_nostopw[[1]],20)
##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"
63
CustomizestopwordlistStopwordsforotherlanguges:checkthestopwordspackage
Removefeaturefromstopwordlist
"will"%in%stopwords("en")
##[1]TRUE
my_stopwords_en<-stopwords("en")[!stopwords("en")%in%c("will")]
"will"%in%my_stopwords_en
##[1]FALSE
64
head(toks_immig[[1]],20)
##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","
head(toks_nopunct[[1]],20)
##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"
head(toks_nostopw[[1]],20)
##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"
65
Selectcertaintermstoks_immig_select<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=TRUE)head(toks_immig_select[[1]],20)
##[1]"IMMIGRATION"""""""""##[6]""""""""""##[11]""""""""""##[16]"immigration"""""""""
toks_immig_no_pad<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=FALSE)head(toks_immig_no_pad[[1]],20)
##[1]"IMMIGRATION""immigration""immigration""immigrants""immigration"##[6]"immigration""immigration""immigrants""immigrant""immigrant"##[11]"immigrants""immigrants""immigration""Immigration""immigration"##[16]"immigrants""Immigration""Immigration""Immigration""immigrants"
66
Selectcertaintermsandtheircontext#specifynumberofsurroundingwordsusingwindowtoks_immig_window<-tokens_select(toks_immig,pattern=c('immig*','migra*'),padding=TRUE,window=5)head(toks_immig_window[[1]],20)
##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH"""""##[9]"""""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","
67
Compoundmultiwordsexpressionskw_multi<-kwic(data_char_ukimmig2010,phrase(c('asylumseeker*','britishcitizen*')))head(kw_multi,5)
####[BNP,1724:1725]thehonourandbenefitof|Britishcitizenship|##[BNP,1958:1959]allillegalimmigrantsandbogus|asylumseekers|##[BNP,2159:2160]regionconcerned.An'|asylumseeker|##[BNP,2192:2193]country.Becauseevery'|asylumseeker|##[BNP,2218:2219]therearecurrentlynolegal|asylumseekers|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed##inBritaintoday.It
68
CompoundmultiwordsexpressionsPreservetheseexpressionsinbag-of-wordanalysis:
toks_comp<-tokens(data_char_ukimmig2010)%>%tokens_compound(phrase(c('asylumseeker*','britishcitizen*')))
kw_comp<-kwic(toks_comp,c('asylum_seeker*','british_citizen*'))head(kw_comp,4)
####[BNP,1724]thehonourandbenefitof|British_citizenship|##[BNP,1957]allillegalimmigrantsandbogus|asylum_seekers|##[BNP,2157]regionconcerned.An'|asylum_seeker|##[BNP,2189]country.Becauseevery'|asylum_seeker|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed
69
Exercise1.Tokenizedata_corpus_irishbudget2010andcompoundthefollowingpartynames:fiannafáil,finegael,andsinnféin.
2.Selectonlythethreepartynamesandthewindowof+-10words
3.Howcanweextractonlythefullsentencesinwhichatleastoneofthepartiesismentioned?
70
Solution#1.Tokenize`data_corpus_irishbudget2010`andcompoundpartynamestoks_ire<-data_corpus_irishbudget2010%>%tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))
nrow(kwic(toks_ire,"fianna_fáil"))
##[1]66
nrow(kwic(toks_ire,phrase("fiannafáil")))
##[1]0
71
Solution(cont.)#2.Selectonlythethreepartynamesandthewindowof+-10wordstoks_ire_select<-toks_ire%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=10)
head(toks_ire_select[[2]],15)
##[1]"happening""today"".""It""is"##[6]"happening"",""however"",""because"##[11]"Fianna_Fáil""failed""to""heed""the"
Note:Togetthesentencesthatmentiononeoftheparties,callcorpus_reshape(to="sentences")first,repeatthesubsequentstepsandsetaverylargewindow(e.g.,1000).
#3.Extractonlyfullsentencesinwhich#atleastonepartyismentioned
toks_ire_sentences<-data_corpus_irishbudget2010%>%corpus_reshape(to="sentences")%>%#<--importantstep!tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=1000)
72
N-grams#Unigramstokens("insurgentskilledinongoingfighting")
##tokensfrom1document.##text1:##[1]"insurgents""killed""in""ongoing""fighting"
#Bigramstokens("insurgentskilledinongoingfighting")%>%tokens_ngrams(n=2)
##tokensfrom1document.##text1:##[1]"insurgents_killed""killed_in""in_ongoing"##[4]"ongoing_fighting"
73
Skipgrams#Skipgramstokens("insurgentskilledinongoingfighting")%>%tokens_skipgrams(n=2,skip=0:1)
##tokensfrom1document.##text1:##[1]"insurgents_killed""insurgents_in""killed_in"##[4]"killed_ongoing""in_ongoing""in_fighting"##[7]"ongoing_fighting"
74
Lookuptokensfromadictionarytoks<-tokens(data_char_ukimmig2010)
dict<-dictionary(list(refugee=c('refugee*','asylum*'),worker=c('worker*','employee*')))print(dict)
##Dictionaryobjectwith2keyentries.##-[refugee]:##-refugee*,asylum*##-[worker]:##-worker*,employee*
dict_toks<-tokens_lookup(toks,dictionary=dict)head(dict_toks,2)
##tokensfrom2documents.##BNP:##[1]"refugee""worker""refugee""refugee""refugee""refugee""refugee"##[8]"refugee""refugee""refugee""refugee""refugee""refugee""worker"####Coalition:##[1]"refugee"
75
Thetransitionfromtokens()todfm()dfm(dict_toks)
##Document-featurematrixof:9documents,2features(38.9%sparse).##9x2sparseMatrixofclass"dfm"##features##docsrefugeeworker##BNP122##Coalition10##Conservative00##Greens41##Labour44##LibDem60##PC30##SNP10##UKIP60
76
Summaryoftokensfunctionstokens()
tokens_tolower()/tokens_toupper()
tokens_wordstem()
tokens_compound()
tokens_lookup()
tokens_ngrams()
tokens_skipgrams()
tokens_select()/tokens_remove()/tokens_keep()
tokens_replace()
tokens_sample()
tokens_subset()
Recalltouse?toreadthemanualandexamplesforeachfunction.
77
Exercise1.Tokenizedata_corpus_irishbudget2010
2.Convertthetokensobject,removepunctuation,changetolowercase,removestopwords,andstem
3.Getthenumberoftypesandtokensperspeech
78
Solutiontoks_ire<-data_corpus_irishbudget2010%>%tokens(remove_punct=TRUE)%>%tokens_remove(stopwords("en"))%>%tokens_tolower()%>%tokens_wordstem()
ire_ntype<-ntype(toks_ire)ire_ntoken<-ntoken(toks_ire)
79
Document-featurematrix
Exercise1.Createadocument-featurematrixfromthetokensobjectabove.
2.Getthe50mostfrequenttermsusingtopfeatures().
81
Solutiontoks_ire<-tokens(data_corpus_irishbudget2010)
dfmat_ire<-dfm(toks_ire)#dfm()transformstolowercasebydefault
topfeatures(dfmat_ire)
##the.to,ofandinaisthat##35982371163315481537135912311013868804
##alternativeapproachwithouttokens()--lesscontrol!dfmat_ire2<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)
topfeatures(dfmat_ire2)
##peoplbudgetgovernministyeartaxpubliceconomicut##273272271204201195179172172##job##148
82
Selectfeaturesbasedonfrequenciesnfeat(dfmat_ire)
##[1]5140
dfmat_ire_trim1<-dfm_trim(dfmat_ire,min_termfreq=5)nfeat(dfmat_ire_trim1)
##[1]1265
dfmat_ire_trim2<-dfm_trim(dfmat_ire,min_docfreq=5)nfeat(dfmat_ire_trim2)
##[1]864
dfmat_ire_trim3<-dfm_trim(dfmat_ire,max_docfreq=0.1,docfreq_type="prop")nfeat(dfmat_ire_trim3)
##[1]2716
83
Exercise1.Createadfmfromthetwodocumentsbelow,andweightitusingdfm_weight().
2.Fordfm_weight()tryouttheschemearguments"count","boolean"and"prop".
3.Whatareadvantagesandproblemsofeachweightingscheme?
texts<-c("appleisbetterthanbanana","bananabananaapplemuchbetter")
84
Solutiondfmat_fruits<-dfm(texts)dfm_weight(dfmat_fruits,scheme="count")
##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101021
dfm_weight(dfmat_fruits,scheme="boolean")#
##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101011
85
Solution(cont.)dfm_weight(dfmat_fruits,scheme="prop")
##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text10.20.20.20.20.20##text20.200.200.40.2
86
Sentimentanalysis
WhyareweanalyzingIrishbudgetspeeches?
88
SentimentanalysisSentimentanalysisusingtheLexicoderSentimentDictionary(data_dictionary_LSD2015)
summary(data_dictionary_LSD2015)
##LengthClassMode##negative2858-none-character##positive1709-none-character##neg_positive1721-none-character##neg_negative2860-none-character
docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")
#tokenizeandapplydictionarytoks_dict<-data_corpus_irishbudget2010%>%tokens()%>%tokens_lookup(dictionary=data_dictionary_LSD2015)
#transformtoadfmdfmat_dict<-dfm(toks_dict)
89
SentimentAnalysisprint(dfmat_dict)
##Document-featurematrixof:14documents,4features(12.5%sparse).##14x4sparseMatrixofclass"dfm"##features##docsnegativepositiveneg_positiveneg_negative##Lenihan,Brian(FF)18839721##Bruton,Richard(FG)16314752##Burton,Joan(LAB)22526683##Morgan,Arthur(SF)26024952##Cowen,Brian(FF)15036812##Kenny,Enda(FG)10414633##ODonnell,Kieran(FG)498440##Gilmore,Eamon(LAB)16417630##Higgins,Michael(LAB)374210##Quinn,Ruairi(LAB)344011##Gormley,John(Green)175600##Ryan,Eamon(Green)247812##Cuffe,Ciaran(Green)385600##OCaolain,Caoimhghin(SF)15414511
90
Converttoadataframeusingconvert()dict_output<-convert(dfmat_dict,to="data.frame")
91
Estimatesentimentdict_output$sent_score<-log((dict_output$positive+dict_output$neg_negative+0.5)/(dict_output$negative+dict_output$neg_positive+0.5))
dict_output<-cbind(dict_output,docvars(data_corpus_irishbudget2010))
92
Plottingwithggplot2Specifydata(adata.frame)
Specifytheaestetics(x-axis,y-axis,colours,shapesetc.)
Chooseageometricobjects(e.g.scatterplot,boxplot)
#discovermtcarsdatasethead(mtcars,3)
##mpgcyldisphpdratwtqsecvsamgearcarb##MazdaRX421.061601103.902.62016.460144##MazdaRX4Wag21.061601103.902.87517.020144##Datsun71022.84108933.852.32018.611141
#specifydataandaxeslibrary(ggplot2)myplot<-ggplot(data=mtcars,aes(x=mpg,y=hp))
93
myplot
94
myplot+geom_point()
95
myplot+geom_point(aes(colour=factor(cyl)))+scale_colour_discrete(name="Cylinders")+scale_shape_discrete(name="Cylinders")+labs(x="Milespergallon",y="Horsepower")
96
Plotsentimenttplot_sent<-ggplot(dict_output,aes(x=reorder(name,sent_score),y=sent_score,colour=gov_opp))+geom_point(size=3)+coord_flip()+labs(x="Speaker",y="Estimatedsentiment")
97
tplot_sent
98
Textualstatistics
Simplefrequencyanalysislibrary("ggplot2")ire_dfm<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),remove_punct=TRUE,remove_symbols=TRUE)
ire_freqplot<-ire_dfm%>%textstat_frequency(n=15)%>%ggplot(aes(x=reorder(feature,frequency),y=frequency))+geom_point()+coord_flip()+labs(x=NULL,y="Frequency")
100
ire_freqplot
101
Frequencyanalysisforgroupsire_dfm_group<-ire_dfm%>%dfm_group(groups="party")
102
#plotawordcloud(notrecommended...)set.seed(34)textplot_wordcloud(ire_dfm_group,comparison=TRUE,max_words=40)
103
Relativefrequencyanalysis(keyness)"Keyness"assignsscorestofeaturesthatoccurdifferentiallyacrossdifferentcategories
docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")
#comparegovernmenttooppositionpartiesbychi^2dfmat_key<-data_corpus_irishbudget2010%>%dfm(groups="gov_opp",remove=stopwords("english"),remove_punct=TRUE,remove_symbols=TRUE)
tstat_key<-textstat_keyness(dfmat_key,target="Opposition")
tplot_key<-textplot_keyness(tstat_key,margin=0.2,n=10)
104
tplot_key
105
Termfrequency-inversedocumentfrequency(tf-idf )Example:Wehave politicalpartymanifestos,eachwith1000words.Thefirstdocumentcontains16instancesoftheword"environment"; ofthemanifestoscontaintheword"environment".
Thetermfrequencyis
Thedocumentfrequencyis ,or
Thetf-idfwillthenbe
Ifthewordhadonlyappearedonlyin ofthe manifestos,thenthetf-idfwouldbe
tf-idffiltersoutcommonterms
10040
16
100/40 = 2.5 log(2.5) = 0.398
16 × 0.398 = 6.37
15 10013.18
106
Applytf-idfweightingdfmat_ire_tfidf<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)%>%dfm_group(groups="party")%>%dfm_tfidf()
107
Plottf-idftstat_freq_tfidf<-dfmat_ire_tfidf%>%textstat_frequency(n=10,groups="party",force=TRUE)
#plotfrequenciestplot_tfidf<-ggplot(data=tstat_freq_tfidf,aes(x=factor(nrow(tstat_freq_tfidf):1),y=frequency))+geom_point()+facet_wrap(~group,scales="free_y")+coord_flip()+scale_x_discrete(breaks=factor(nrow(tstat_freq_tfidf):1),labels=tstat_freq_tfidf$feature)+labs(x=NULL,y="tf-idf")
108
tplot_tfidf
109
Collocationanalysis:astringsofwordsapproachtstat_col<-data_corpus_irishbudget2010%>%tokens()%>%textstat_collocations(min_count=20,tolower=TRUE)
head(tstat_col,7)
##collocationcountcount_nestedlengthlambdaz##1itis188023.68333836.89575##2willbe142024.13230135.59424##3thisbudget107024.29506231.81840##4wehave119023.38272129.50924##5morethan56026.29716728.82909##6socialwelfare70028.08114328.82286##7inthe347021.85505227.87367
110
Exercise1.Repeatthestepabove,butremovestopwords,andstemthetokensobject.
2.Comparethemostfrequentcollocations.Whathaschanged?
3.Optional:ConductacollocationanalysisfortheGermaninauguralspeeches.Doesitworkwell?
111
Solutiontoks_ire_adjusted<-data_corpus_irishbudget2010%>%tokens()%>%tokens_remove(pattern=stopwords("en"))%>%tokens_wordstem()
tstat_col_adjusted<-toks_ire_adjusted%>%textstat_collocations(min_count=5,tolower=FALSE)
head(tstat_col_adjusted,7)
##collocationcountcount_nestedlengthlambdaz##1publicservic68025.46713226.76130##2socialwelfar65027.16804625.84665##3childbenefit36026.87543121.50678##4perweek25025.82873119.40465##5nextyear34025.20099418.86080##6publicsector30024.08902817.70144##7LabourParti23027.48637517.12913
nrow(tstat_col_adjusted)
##[1]224
112
CompoundcollocationsBeforecompounding
toks_ire<-toks_ire_adjustedkwic(toks_ire,phrase("GreenParty"))%>%head(4)
##[1]pattern##<0rows>(or0-lengthrow.names)
113
CompoundcollocationsCompoundbasedoncollocations
#get50mostfrequentcollocations
#compoundbasedoncollocationstoks_ire_comp<-toks_ire%>%tokens_compound(tstat_col_adjusted[tstat_col_adjusted$z>3])
Aftercompounding
nrow(kwic(toks_ire_comp,phrase("FiannaFáil")))
##[1]0
nrow(kwic(toks_ire_comp,phrase("Fianna_Fáil*")))
##[1]70
114
CompoundcollocationsCheckcompoundedwords
toks_ire_comp%>%tokens_keep(pattern="*_*")%>%dfm()%>%topfeatures()
##fianna_fáilpublic_servicsocial_welfarnext_yearchild_benefit##6147413830##minist_financper_weekpublic_sectoryoung_peopl€_4_billion##3025252320
115
SimilaritybetweentextsCosinesimilarity
(dfmat<-dfm(c("XYYZZZ","XXXXYYYYYZZZZZZ")))
##Document-featurematrixof:2documents,3features(0.0%sparse).##2x3sparseMatrixofclass"dfm"##features##docsxyz##text1123##text2456
textstat_simil(dfmat,method="cosine")
##textstat_similobject;method="cosine"##text1text2##text11.0000.975
cos(A,B) =A ⋅ B
||A|| ⋅ ||B||
cos(A,B) =∑AiBi
√∑A2i√∑B2
i
116
Cosinesimilaritylibrary(quanteda)dfmat<-dfm(c("Iusethissentencealmosttwice.","Iincludethisalmostsentencetwice.","Andherewehaveirrelevantcontent."))
tstat_simil<-textstat_simil(dfmat,method="cosine")tstat_simil
##textstat_similobject;method="cosine"##text1text2text3##text11.0000.8570.143##text20.8571.0000.143##text30.1430.1431.000
Whyisthesimilaritybetweentext1/text2andtext3not0?
117
BONUS:Transformmatrixintodyadicdataframetstat_simil_matrix<-as.matrix(tstat_simil)tstat_simil_matrix[lower.tri(tstat_simil_matrix)]<-NA
tstat_simil_df<-tstat_simil_matrix%>%reshape2::melt()%>%subset(Var1!=Var2)%>%subset(!is.na(value))
tstat_simil_df
##Var1Var2value##4text1text20.8571429##7text1text30.1428571##8text2text30.1428571
118
Exercise1.Createadfmfromdata_corpus_irishbudget2010
2.Groupitbypartyandruntextstat_simil().
3.Whatpartiesaremostsimilar?
4.Dothesubstantiveconclusionchangewhenremovingstopwordsandpunctuation,andstemmingthedfm?
119
Solution#similaritiesfordocumentststat_simil1<-data_corpus_irishbudget2010%>%dfm()%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")
tstat_simil1
##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.9540.9450.9630.968##FG0.9541.0000.9500.9830.979##Green0.9450.9501.0000.9510.938##LAB0.9630.9830.9511.0000.980##SF0.9680.9790.9380.9801.000
120
Solutiontstat_simil2<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),stem=TRUE,remove_punct=TRUE)%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")
tstat_simil2
##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.6210.6570.6660.695##FG0.6211.0000.6450.7950.791##Green0.6570.6451.0000.6550.651##LAB0.6660.7950.6551.0000.810##SF0.6950.7910.6510.8101.000
121
Lexicaldiversitydfmat_ire<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)
#estimateType-TokenRatiotstat_lexdiv<-textstat_lexdiv(dfmat_ire,measure="TTR")
df_lexdiv<-cbind(tstat_lexdiv,docvars(dfmat_ire))
head(df_lexdiv,4)
##documentTTRyeardebatenumber##Lenihan,Brian(FF)Lenihan,Brian(FF)0.21424872010BUDGET01##Bruton,Richard(FG)Bruton,Richard(FG)0.23643122010BUDGET02##Burton,Joan(LAB)Burton,Joan(LAB)0.25831872010BUDGET03##Morgan,Arthur(SF)Morgan,Arthur(SF)0.22652362010BUDGET04##forennamepartygov_opp##Lenihan,Brian(FF)BrianLenihanFFGovernment##Bruton,Richard(FG)RichardBrutonFGOpposition##Burton,Joan(LAB)JoanBurtonLABOpposition##Morgan,Arthur(SF)ArthurMorganSFOpposition
122
PlotType-TokenRatiotplot_ttr<-ggplot(df_lexdiv,aes(x=reorder(document,TTR),y=TTR))+geom_point()+coord_flip()+labs(x=NULL)
123
tplot_ttr
124
Readabilitytstat_read<-data_corpus_inaugural%>%textstat_readability(measure=c("Flesch.Kincaid","Dale.Chall.old"))
tplot_read<-ggplot(tstat_read,aes(x=Flesch.Kincaid,y=Dale.Chall.old))+geom_smooth()+geom_point()
125
Plotrelationshipbetweenreadabilitymeasurestplot_read
126
Correlationbetweenreadabilitymeasurescor.test(tstat_read$Flesch.Kincaid,tstat_read$Dale.Chall.old)
####Pearson'sproduct-momentcorrelation####data:tstat_read$Flesch.Kincaidandtstat_read$Dale.Chall.old##t=19.714,df=56,p-value<2.2e-16##alternativehypothesis:truecorrelationisnotequalto0##95percentconfidenceinterval:##0.89202470.9611137##sampleestimates:##cor##0.9349112
127
Readabilityofinauguralspeechessince1945tstat_read_subset<-data_corpus_inaugural%>%corpus_subset(Year>1948)%>%textstat_readability(measure="Flesch.Kincaid")
tplot_read_us<-ggplot(tstat_read_subset,aes(x=reorder(document,Flesch.Kincaid),y=Flesch.Kincaid))+geom_point()+coord_flip()+labs(x=NULL,y="Readability(Flesch-Kincaid)")
128
Exercise:Wrappingitallup1.Usedata_corpus_guardianfromthequanteda.corporapackageusing:data_corpus_guardian<-quanteda.corpora::download('data_corpus_guardian').
2.Createatokensobjectandremovestopwordsandpunctuation
3.Keeponlytheterm"Brexit*"andawindowof5words.
4.Usefcm()andcreateafeatureco-occurencematrix.
5.Usetextplot_network()toplotanetworkofco-occurencesofthe40mostfrequentterms.
130
SolutionImportGuardiancorpususingquanteda.corpora'sdownload()function.
data_corpus_guardian<-download('data_corpus_guardian')
toks_brexit<-data_corpus_guardian%>%tokens(remove_punct=TRUE,remove_symbols=TRUE)%>%tokens_remove(stopwords("en"),padding=FALSE)%>%tokens_keep(pattern="Brexit*",window=5,case_insensitive=TRUE)%>%tokens_tolower()
fcmat_brexit<-toks_brexit%>%fcm(context="window")
feat<-names(topfeatures(fcmat_brexit,40))
fcmat_brexit_subset<-fcmat_brexit%>%fcm_select(feat)
tplot_brexit<-textplot_network(fcmat_brexit_subset)
##RegisteredS3methodoverwrittenby'network':##methodfrom##summary.characterquanteda 131
set.seed(134)tplot_brexit
132
Supervisedtextclassi�cation
Supervisedclassi�cation:separatedocumentsintopre-existingcategoriesWeneed:
1.Hand-codeddataset(labeled),tobesplitintoatrainingset(usedtotraintheclassifier)andavalidation/testset(usedtovalidatetheclassifier)
2.Methodtoextrapolatefromhandcodingtounlabeleddocuments(classifier):NaiveBayes,regularizedregression,SVM,K-nearestneighbours,ensemblemethods
3.Approachtovalidateclassifier:cross-validation
4.Performancemetrictochoosebestclassifierandavoidoverfitting:confusionmatrix,accuracy,precision,recall,F1scores
134
Goalsofsupervisedclassi�cation1.Flexibility
2.Efficiency
3.Interpretability
135
Unsupervisedclassi�cationUnsupervisedmethodsscaledocumentsbasedonpatternsofsimilarityfromtheterm-documentmatrix,withoutrequiringatrainingstep
Examples:Wordfish,topicmodels(tobecoveredsoon)
Relativeadvantage:Youdonothavetoknowthedimensionbeingscaled(alsoadisadvantage!)
136
Basicprinciplesofsupervisedlearning1.Generalisation:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsnotonlyinpreviouslyseensamplesbutalsoinpreviouslyunseensamples
2.Overfitting:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsinpreviouslyseensamplesbutfailstodosoinpreviouslyunseensamples.Thiscausespoorprediction/generalisation.
137
Howtoassesstheperformanceofaclassi�er?
138
Importantmeasures
Accuracy:
Precision:
Recall:
F1score:
=Correctly classified
Total number of cases
true positives + true negatives
Total number of cases
true positives
true positives + false positives
true positives
true positives + false negatives
2 × Precision × RecallPrecision + Recall
139
Howdowegetalabelledtrainingset?1.Externalsourcesofannotation
2.Expertannotation
3.Crowd-sourcedcoding
Wisdomofcrowds--aggregatedjudgmentsbynon-expertsconvergetojudgmentsofexperts,dependingondifficultyofclassificationtaks(Benoitetal.2016)
Platforms:Figure8(previouslycalledCrowdFlower)orAmazon'sMTurk
140
NaiveBayes:ProbabilisticlearningIntuition:Ifweobservetheterm"fantastic""inatext,howlikelyisthistextapositivereview?
1.Determinefrequencyofterminpositiveandnegativereviews(prior).
2.Assessproabilityoffeaturesgivenaparticularclass.
3.Getprobabiliyofadocumentbelongingtoeachclass(posterior).
4.Whichposteriorishighest?
141
NaiveBayesAdvantages
Simple,fast,effective
Relativelysmalltrainingsetrequired(ifclassesnotooimbalanced)
Easytoobtainprobabilities
Disadvantages
Assumptionofconditionalindependenceisproblematic
Iffeatureisnotintrainingset,itisdisregardedfortheclassification
142
NaiveBayesinquanteda:Example1#1.gettrainingsetdfmat_train<-dfm(c("positivebadnegativehorrible","greatfantasticnice"))class<-c("neg","pos")
#2.trainmodeltmod_nb<-textmodel_nb(dfmat_train,class)
#3.getunlabelledtestsetdfmat_test<-dfm(c("badhorriblenegativeawful","nicebadgreat","greatfantastic"))
#4.predictclasspredict(tmod_nb,dfmat_test,force=TRUE)
##Warning:1featureinnewdatanotusedinprediction.
##text1text2text3##negpospos##Levels:negpos
143
NaiveBayesinquanteda:Example2Goal:predictwhetherUSinauguralspeechwasdeliveredbeforeorafter1945
#createuniqueIDdocvars(data_corpus_inaugural,"id")<-1:ndoc(data_corpus_inaugural)
#createdummy(before/after1945)docvars(data_corpus_inaugural,"pre_post1945")<-ifelse(docvars(data_corpus_inaugural,"Year")>1945,"Post1945","Pre1945")
#sample35speechesset.seed(123)speeches_select<-sample(1:58,size=35,replace=FALSE)
#createtrainingandtestsetsdfmat_train<-data_corpus_inaugural%>%corpus_subset(id%in%speeches_select)%>%#speechiniddfm()
dfmat_test<-data_corpus_inaugural%>%corpus_subset(!id%in%speeches_select)%>%#speechnotiniddfm()
144
NaiveBayesinquanteda:Example2#numberofdocumentsintrainingsetndoc(dfmat_train)
##[1]35
#trainNaiveBayesmodeltmod_nb<-textmodel_nb(dfmat_train,y=docvars(dfmat_train,"pre_post1945"))
#numberofdocumentsintestsetndoc(dfmat_test)
##[1]23
#predictclassofheld-outtestsettmod_nb_predict<-predict(tmod_nb,newdata=dfmat_test,force=TRUE)#setforcetoTRUEorusedfm_match
##Warning:1268featuresinnewdatanotusedinprediction.
145
NaiveBayesinquanteda:Example2#createcross-tabletab<-table(actual=docvars(dfmat_test,"pre_post1945"),predicted=tmod_nb_predict)
print(tab)
##predicted##actualPost1945Pre1945##Post194560##Pre1945314
146
NaiveBayesinquanteda:Example2Assessaccuracyusingthecaretpackage
library(caret)performance<-caret::confusionMatrix(tab)
#getmeasuresofclassificationperformanceperformance$byClass
##SensitivitySpecificityPosPredValue##0.66666671.00000001.0000000##NegPredValuePrecisionRecall##0.82352941.00000000.6666667##F1PrevalenceDetectionRate##0.80000000.39130430.2608696##DetectionPrevalenceBalancedAccuracy##0.26086960.8333333
147
Topicmodels
Requiredpackages:topicmodelsandstminstall.packages("stm")install.packages("topicmodels")
packageVersion("topicmodels")
##[1]'0.2.8'
packageVersion("stm")
##[1]'1.3.3'
149
Whataretopicmodels?Algorithmtofindmostimportant"themes/frames/topics"inacorpus
Furtherinformationortrainingdatanotnecessary(entirelyunsupervisedmethod)
Researcheronlyneedstospecifynumberoftopics(moredifficultthanitsounds!)
150
Examplesofastudyusingtopicmodels
Catalinac(2016)AmyCatalinac(2016).FromPorktoPolicy:TheRiseofProgrammaticCampaigninginJapaneseElections.TheJournalofPolitics78(1):1–18.
Japaneseelectoralsystem
Priorto1994:Single-nontransferable-voteinmultimemberdistricts(SNTV-MMD)
Since1994:mixed-membermajoritarian(MMM)electoralsystem
Expectatations:
Morepork-barrelpoliticsunderSNTV-MMD
UnderSNTV-MMD,candidatesfacinghigherlevelsofintrapartycompetitionreliedmoreonpork-barrelpolitics
152
Catalinac(2016):ResearchdesignAJapanesecandidatemanifesto
153
Catalinac(2016):Results
154
Catalinac(2016):Results
155
Catalinac(2016):Results
156
AssumptionsEachdocumentconsistsofamixtureoftopics(drawnfromLDAdistribution)
Eachtopiciscorrelatedmorestronglywithcertainterms
Mosttopicmodelsrelyonbag-of-words:orderofwordswithindocumentisirrelevant
157
Assumptions
158
ExamplefromBlei(2012)
159
LatentDirichletAllocationThemostcommonformoftopicmodelingisLatentDirichletAllocationorLDA.LDAworksasfollows:
1.SpecifcyK(numberoftopics).2Eachwordinthedfmisassignedrandomlytooneofthetopics(involvesaDirichletdistribution).Numbersassignedacrossthetopicsaddupto1.
2.Topicassignmentsforeachwordareupdatedinaniterativefashionbyupdatingtheprevalenceofthewordacrossthetopics,aswellastheprevalenceofthetopicsinthedocument.
3.Assignmentstopsafteruser-specifiedthreshold,orwheniterationsbegintohavelittelimpactontheprobabilitiesassignedtowordsinthecorpus.
4.Output:1.Identifywordsthataremostfrequentlyassociatedwitheachofthetopics.
2.Probabilitythateachdocumentwithinthecorpusisassociatedwitheachofthetopics.Probabilitiesaddupto1.
Source:TutorialbyChrisBail.
160
Example:CMU2008PoliticalBlogCorpusURL:http://www.sailing.cs.cmu.edu/main/?page_id=710
#loadpackageslibrary(quanteda)library(readtext)library(stm)
#urlofblogpostsurl_blogs<-"https://uclspp.github.io/datasets/data/poliblogs2008.zip"
#downloadblogpostsdat_blogs<-readtext(url_blogs)
#createacorpuscorp_blogs<-corpus(dat_blogs,text_field="documents")
#sample1,500documentsset.seed(134)corp_blogs_sample<-corpus_sample(corp_blogs,size=1500)
161
STMexample:createcorpussummary(corp_blogs_sample,6)
##Corpusconsistingof1500documents,showing6documents:####TextTypesTokensSentencestextdocname##poliblogs2008.csv.131321623221113132tpm0834400_0.text##poliblogs2008.csv.8362196324108362ha0831900_11.text##poliblogs2008.csv.8618246449148618ha0834700_3.text##poliblogs2008.csv.834419353083at0801200_4.text##poliblogs2008.csv.7206169266117206ha0821900_5.text##poliblogs2008.csv.111244167212711124tp0829400_9.text##ratingdayblog##Liberal344tpm##Conservative319ha##Conservative347ha##Conservative12at##Conservative219ha##Liberal294tp####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:562019##Notes:
162
STMexample:createdfmandremovefeaturesdfmat_blogs<-dfm(corp_blogs_sample,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE,remove_numbers=TRUE)
dfmat_blogs_trim<-dfm_trim(dfmat_blogs,min_termfreq=10,min_docfreq=5)
163
STMexample:converttostm#convertdfmtostmformatstm_blogs<-convert(dfmat_blogs_trim,to="stm",docvars=docvars(dfmat_blogs))
164
RunningtheSTMUsethestm()function.Kspecifiesthenumberoftopics
WithK=0thestmpackageselectsthenumberoftopicsautomatically(basedonalgorithmdevelopedbyLeeandMimno(2014))
Specifycovariateswithprevalence
165
RunningtheSTM#runstructuraltopicmodelstm_object<-stm(documents=stm_blogs$documents,vocab=stm_blogs$vocab,data=stm_blogs$meta,prevalence=~rating+s(day),K=20,seed=12345)
##BeginningSpectralInitialization##Calculatingthegrammatrix...##Findinganchorwords...##....................##Recoveringinitialization...##........................................##Initializationcomplete.##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration1(approx.perwordbound=-7.352)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration2(approx.perwordbound=-7.136,relativechange=2.942e-02)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration3(approx.perwordbound=-7.098,relativechange=5.204e-03)
166
plot(stm_object)
167
Estimatee�ects#pickandlabeltopicsofinteresttopics_of_interest<-c(3,6,16,18)topic_labels<-c("Obama","FinancialCrisis","Bush","Iraq")
168
IdentifytopicsofinterestlabelTopics(stm_object,c(3,6))
##Topic3TopWords:##HighestProb:media,say,think,peopl,go,know,get##FREX:media,matthew,rush,fox,beck,chris,guy##Lift:bachmann,limbaugh,beck,hardbal,matthew,luntz,o'reilli##Score:bachmann,beck,matthew,limbaugh,clinton,hillari,o'reilli##Topic6TopWords:##HighestProb:time,report,year,new,news,global,warm##FREX:warm,climat,ice,global,coverag,newspap,ap##Lift:sheet,sulzberg,temperatur,solar,warm,murdoch,pinch##Score:sheet,climat,ice,warm,temperatur,sulzberg,solar
169
IdentifytopicsofinterestlabelTopics(stm_object,c(16,18))
##Topic16TopWords:##HighestProb:bush,iraq,war,presid,administr,said,year##FREX:bush,saddam,war,hussein,iraq,georg,administr##Lift:gonzal,wmd,saddam,shoe,hussein,powel,colin##Score:bush,iraq,gonzal,saddam,powel,iraqi,rice##Topic18TopWords:##HighestProb:one,like,can,time,just,think,even##FREX:film,love,movi,dowd,rememb,allen,exit##Lift:pole,filmmak,film,movi,documentari,tuzla,ala##Score:pole,film,dowd,movi,allen,mom,hollywood
170
MeaningsofOutputHighestProbability:Wordswithhighestprobabiliyofoccuringintopic
FREX:"FrequencyandExclusive"--wordsthatdistinguishtopicfromallothertopics
Lift&Score:Indicatorsfromlda/maptpxpackages
171
Findrepresentativedocumentsoftopic#note:outputtoolongforslidefindThoughts(stm_object,texts(corp_blogs_sample),topics=topics_of_interest,n=3)
172
Estimatee�ectofcovariates#estimateeffectstm_effects<-estimateEffect(topics_of_interest~rating+s(day),stmobj=stm_object,meta=stm_blogs$meta,uncertainty="Global")
summary(stm_effects)
173
Plotin�uenceofdiscretevariableplot(stm_effects,covariate="rating",topics=topics_of_interest,model=stm_object,method="difference",cov.value1="Liberal",cov.value2="Conservative",xlim=c(-0.1,0.1),labeltype="custom",custom.labels=topic_labels,#payattentiontolabelling!main="EffectofConservativevs.Liberal",xlab="MoreConservative...MoreLiberal")
plot_effect_libcon<-recordPlot()plot.new()
174
plot_effect_libcon
175
Plote�ectofcontinuousvariableplot(stm_effects,covariate="day",method="continuous",topics=topics_of_interest,model=stm_object,labeltype="custom",custom.labels=topic_labels,xaxt="n",xlab="Time",main="TopicPrevalenceOverTime")#setthex-axislabelsasmonthnamesblog_dates<-seq(from=as.Date("2008-01-01"),to=as.Date("2008-12-01"),by="month")
axis(1,blog_dates-min(blog_dates),labels=months(blog_dates))
plot_effect_time<-recordPlot()plot.new()
176
plot_effect_time
177
Create(somewhat)better/tidyierplotstidystm:https://github.com/mikajoh/tidystm
stminsights:https://github.com/cschwem2er/stminsights
178
ReferencesandusefuldescriptionsMargaretRoberts(2017):StructuralTopicModels
ChrisBail(2018):TopicModelling
179
InstallpackagesWewillworkwiththedplyrandggplot2packages.
install.packages("dplyr")install.packages("ggplot2")
Havinginstalledthepackage,weneedtoloadthemintheRenvironment.
library(dplyr)library(ggplot2)
180
Gapminder:oursampledatasetWewillusethegapminderdatasetthroughoutthiscourse.
install.packages("gapminder")
library(gapminder)
181
GettingtoknowadatasetUnderstandthestructureandvariablecodingofadataset
str(gapminder)
##Classes'tbl_df','tbl'and'data.frame':1704obs.of6variables:##$country:Factorw/142levels"Afghanistan",..:1111111111...##$continent:Factorw/5levels"Africa","Americas",..:3333333333...##$year:int1952195719621967197219771982198719921997...##$lifeExp:num28.830.3323436.1...##$pop:int84253339240934102670831153796613079460148803721288181613867957163179212##$gdpPercap:num779821853836740...
182
Printthe�rstsixrowshead(gapminder)
###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1AfghanistanAsia195228.88425333779.##2AfghanistanAsia195730.39240934821.##3AfghanistanAsia196232.010267083853.##4AfghanistanAsia196734.011537966836.##5AfghanistanAsia197236.113079460740.##6AfghanistanAsia197738.414880372786.
Wecanfindoutmoreinformationonthedatausing?gapminder.
183
DatawranglingwithdpylrImportantfunctions:
select:pickcolumnsbyname
filter:keeprowsmatchingcriteria
arrange:reorderrows
mutate:addnewvariables
group_by:grouprowsbycategory
summarise:reducevariablestovalues
Datavisualisationwithggplot2
184
Datawrangling:Sampledataframe(df<-data.frame(colour=c("green","black","black","green","black","red"),value=1:6))
##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6
185
Filter/subsetadataframefilter(df,colour=="green")
##colourvalue##1green1##2green4
186
Applyingthepipe(%>%)operatordf%>%filter(colour!="red")%>%filter(value>=5)
##colourvalue##1black5
df%>%filter(colour!="red"&value>=5)
##colourvalue##1black5
187
Exercise1.Loadgapminderdataset(library(gapminder).
2.Type?gapminderandView(gapminder)toinspectthedata.
3.FilteronlyobservationsfromNorway(callnewobjectdat_nor)
4.FilteronlyobservationsfromEuropeorAfrica(gap_europe_africa)
188
Solution#loadgapminderdatalibrary(gapminder)
#variablenamesnames(gapminder)
##[1]"country""continent""year""lifeExp""pop""gdpPercap"
#numberofrowsandcolumnsdim(gapminder)
##[1]17046
189
Solutiondat_nor<-gapminder%>%filter(country=="Norway")
nrow(dat_nor)
##[1]12
gap_europe_africa<-gapminder%>%filter(continent%in%c("Europe","Africa"))
190
Datavisualisationwithggplot2Basicstructure:
1.Callggplot()
2.Specifydatawithdata
3.Specifycoordinatesystem/axeswithaes()
plot1<-ggplot(data=gapminder,aes(x=year,y=lifeExp))
191
plot1
192
plot1+geom_point()
193
Hands-onExercise1.Usegapminder,createdataframewithonlywithobservationsfromEuropeorAfricaintheyear2007
2.Createaboxplotwithcontinentonx-axisandlifeExpony-axis
194
Solutiondat_europe_africa_2007<-gapminder%>%filter(continent%in%c("Europe","Africa")&year==2007)
plot_boxplot<-ggplot(dat_europe_africa_2007,aes(x=continent,y=lifeExp))+geom_boxplot()
195
plot_boxplot
196
plot_boxplot+labs(x="Continent",y="LifeExpectancyin2007")
197
GeneraladviceSaveplotsasPDF(vectorgraphicandsmall):
Settheggplot2themeglobally
#exampleforsavingaplotggsave("boxplot.pdf",plot_boxplot,width=5,height=7)
#exampleforsettingathemeggplot2::theme_set(theme_minimal())
198
Densityplotsplot_density<-ggplot(dat_europe_africa_2007,aes(x=lifeExp,fill=continent))+geom_density(alpha=0.7)+labs(x="LifeExpectancyin2007")
199
plot_density
200
plot_density+scale_x_continuous(limits=c(0,100))+scale_fill_discrete(name="Continent")+theme(legend.position="bottom")
201
Boxplotswithgroupsgapminder_post_1980<-gapminder%>%filter(year>1980&continent!="Oceania")
plot_boxes<-ggplot(data=gapminder_post_1980,aes(x=as.factor(year),y=lifeExp))+geom_boxplot()+labs(x="Year",y="Lifeexpectancy")
202
plot_boxes
203
Addfacetsforafactorvariableplot_boxes+facet_wrap(~continent)
204
Arrange/reorderadatasetgap_arranged<-gapminder%>%arrange(lifeExp)#ascending
head(gap_arranged)
###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1RwandaAfrica199223.67290203737.##2AfghanistanAsia195228.88425333779.##3GambiaAfrica195230284320485.##4AngolaAfrica195230.042320953521.##5SierraLeoneAfrica195230.32143249880.##6AfghanistanAsia195730.39240934821.
205
Arrangedatasetgap_arranged_desc<-gapminder%>%arrange(-lifeExp)#descending
head(gap_arranged_desc)
###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1JapanAsia200782.612746797231656.##2HongKong,ChinaAsia200782.2698041239725.##3JapanAsia20028212706584128605.##4IcelandEurope200781.830193136181.##5SwitzerlandEurope200781.7755466137506.##6HongKong,ChinaAsia200281.5676247630209.
206
Selectvariablesgapminder_select<-gapminder%>%select(year,country,lifeExp)
head(gapminder_select,2)
###Atibble:2x3##yearcountrylifeExp##<int><fct><dbl>##11952Afghanistan28.8##21957Afghanistan30.3
207
Reordervariableswithindataframegapminder_reordered<-gapminder%>%select(pop,year,everything())
head(gapminder_reordered,2)
###Atibble:2x6##popyearcountrycontinentlifeExpgdpPercap##<int><int><fct><fct><dbl><dbl>##184253331952AfghanistanAsia28.8779.##292409341957AfghanistanAsia30.3821.
208
Exercise1.Selectcountry,continent,year,lifeExpandpop
2.Arrangegapminderascendingbyyearandwithinyeardescendingbypopulation.
209
Solutiongapminder_arranged<-gapminder%>%select(-gdpPercap)%>%arrange(year,-pop)
#checkView(gapminder_arranged)
210
Groupdatasetandsummarisewithingroupsdf
##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6
#getsumofvaluevariabledf%>%summarise(sum=sum(value))
##sum##121
211
Addgroupingvariabledf%>%group_by(colour)%>%summarise(sum=sum(value))
###Atibble:3x2##coloursum##<fct><int>##1black10##2green5##3red6
212
Groupandsummarisedataframedf%>%group_by(colour)%>%summarise(total=n(),sum=sum(value))
###Atibble:3x3##colourtotalsum##<fct><int><int>##1black310##2green25##3red16
213
Exercise1.Usegapminder.
2.Groupbyyear.
3.CalculatetheaveragelifeexpectancyandthemaximumGDPperyear.
214
Solutiondata_gap_summarised<-gapminder%>%group_by(year)%>%summarise(life_exp_mean=mean(lifeExp),gdp_max=max(gdpPercap))
data_gap_summarised
###Atibble:12x3##yearlife_exp_meangdp_max##<int><dbl><dbl>##1195249.1108382.##2195751.5113523.##3196253.695458.##4196755.780895.##5197257.6109348.##6197759.659265.##7198261.533693.##8198763.231541.##9199264.234933.##10199765.041283.##11200265.744684.##12200767.049357.
215
Createnewvariableswithmutatenames(gapminder)
##[1]"country""continent""year""lifeExp""pop""gdpPercap"
gapminder<-gapminder%>%group_by(year)%>%mutate(life_exp_mean=mean(lifeExp),diff_country_mean=lifeExp-life_exp_mean)summary(gapminder$diff_country_mean)
##Min.1stQu.MedianMean3rdQu.Max.##-40.561-9.5831.2930.00010.24023.612
216
FurtherinformationWickham,HadleyandGarrettGrolemund(2017).RforDataScience:Import,Tidy,Transform,Visualize,andModelData.Sebastopol:O’Reilly.
Healy,Kieran(2019).DataVisualization:APracticalIntroduction.Princeton:PrincetonUniversityPress.
217
CourseresourcesDocumentation:
https://quanteda.io
https://readtext.quanteda.io
https://spacyr.quanteda.org
Tutorials:https://tutorials.quanteda.io
Cheatsheet:https://www.rstudio.com/resources/cheatsheets/
KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.
218