Upload
luke-czarnecki
View
60
Download
1
Embed Size (px)
Citation preview
Computerized Text AnalysisThe Practical and Ethical Use of Social Media Data For Social Science Research
Lucas CzarneckiUniversity of Calgary Dept. of Political Science
My Research
Psychological differences between conservatives and liberals
How innate traits like personality, temperament, and moral impulsivities shape ideology
“Do differences in ideology manifest in language use”?
photo by: Luci Gutiérrez
Before I even get started…
Go to www.eventbrite.ca
Using “Find Your Next Experience” search for:
“I Want To Know More About…Computerized Text Analysis”
The Link will take you to:
Please Download…
You will need:
Optional, but highly recommended:
My Two Goals for this Workshop
One Goal
Is to provide practical information (regarding data collection, preprocessing,
and analysis)
Second Goal
To have a wider discussion on text and social media data (the history, and the current state of affairs)
Common Theme Throughout: The Practical and Ethical Use of Social Media Data
What Is Computerized Text Analysis?
Not one method or approach, but many
A Swiss Army Knife for studying language use and text data
any automatized process for categorizing or uncovering latent meaning in word use within or across files with computer-readable formats
History of Text Analysis
1901The Idea has roots to the
earliest days of psychology (e.g. Freud &
Psycholinguistics)
1921Rorschach
Projective TestsLinked Word Use With Psychological Drives
1950sEarly Work in Content
Analysis (still using human coders and
judges)
1966-1978Phillip Stone et al. work
on The First Computerized Text Analysis Program
1990sPrograms like General
Inquirer become available on small
personal computers
1992-1994Earliest work on
Linguistic Inquiry and Word Count
Computerized Text Analysis: Why Now?
“We are in the midst of a technological revolution whereby, for the first time, researcherscan link daily word use to a broad array of real-world behaviors... to detect meaning in a wide
variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.”
Tausczik & Pennebaker & (2007)
ENIAC (1943) – The World’s First* Computer
occupied ~ 1,800 square feet weighed almost 50 tons
…and no pictures of cats
ENIAC – was constructed by the University of Pennsylvania: construction began in 1943…completed in 1946.
IBM’s 1956 5MB Hard-Drive
50 24-inch discs stacked together Took up 16 sq. ft.
Cost IBM ~$35,000 annually
Stored an impressive 5MB of data.
weighed just less than 1 ton…now that’s progress!
A Technological Revolution
Computers are becoming increasingly powerful, while simultaneously less expensive
Graph by: Max Roser – Our World in Data
An Information Revolution
Source: BBVA API-Market
An Information Revolution
Vids/Pics aside… a lot (if notmost) of this is text data
Source: BBVA API-Market
The untapped world of unstructured data
The vast majority of data available online right now is unstructured: Hard to estimate, but about ~80 to 90% of all data online is unstructured. Much of this is text. Roughly 2.5 billion GBs of data are created per day
It is a very exciting time for computerized text analysis!
Before we start data mining…we need to talk about Ethics
Our technology has outpaced our philosophy
In our CommunitiesIn Law and Government
…and in Academia
In Law & Government – e.g. Surveillance Programs
Constitutions/Conventions Evolve Slowly…
…Technology does not
Ethics aside…the NSA Shows us how much Data is really out there
East Germany’s Stasi vs. the NSA
https://apps.opendatacity.de/stasi-vs-nsa/english.html
Academia – e.g. Facebook Emotion Manipulation Experiment (2014)
Universities Must Update Their EthicsIn Face of New Methods
My Ethics Application
What my research involves…
Scrapped: Posts from Party Leaders (N=1,712)Harper = 525 | Trudeau = 531 | Mulcair = 656
Scrapped: Data on Facebook Users Comments = 297,830 | Likes = 3,296,035 | Shares = 855,381
…none of these people know they are in my study
The process of applying for ethics approval…
The Rigorous Ethics Approval Process
Made my life easier
…but I should not be granted this luxury
Getting Started: A Very, very, very brief Intro to R
Let’s start with scalars
Enter: X <- 3
Print(display) your dataX
OrX = 4
R as a calculator, enter:X * 5
Save calculations as distinctly new objectsY <- X * 5
Now for vectors:
Enter A <- c(2,4,6,8) R – Loves functions (e.g. concatenate, mean, etc)
Functions make life easier
Without functions…calculations would need to be done manually, ex:(2+4+6+8)/4
Or:mean(A)
Vector of character strings!
This will give you an error list_of_names <- c(bob,jane,david,mark,john)
Instead;list_of_names <- c("bob","jane","david","mark","john")
list_of_candidates <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")
Humble beginnings
Let’s start with a vector of numbers, example:v <- c(39.5,31.9,19.7,4.7,3.4)
Assign names to elements, enter:names(v) <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")
Print v to view data:v
Look up specific data:v["Stephen Harper"]
Getting Familiar with Text Data
Sentences from Alice In Wonderland
Alice <- c("But I don't want to go among mad people, Alice remarked", "Oh you can't help that, said the Cat, we're all mad here. I'm mad.
You're mad", "How do you know I’m mad, said Alice","You must be, said the Cat, or you wouldn’t have come here")
Print:Alice
FINALLY, Matrices:
Your first matrix:myMatrix <- matrix(data=c(1,2,3,4,5,6), ncol=3)myMatrix
Also works with text (very common): myMatrix <- matrix(data=c("you", "can", "also", "include", "words", "here"), ncol=3)myMatrix
Functions and Packages
Packages are here to help!!!
install.packages(“tm”)require(tm) OR library(tm)
To Save time:Needed <- c("tm", "lsa", "Rfacebook", "twitteR", "ggplot2", "devtools", "wordcloud", "biclust", "cluster", "igraph", "fpc")lapply(Needed, require, character.only = TRUE)
In a nutshell…why use R?
1) It is open source (freeware)2) It is incredibly flexible3) The R Community (R-bloggers, stack exchange, etc.)4) Increasingly popular in academia5) Multi-disciplinary 6) Constantly evolving features
An Overview of Text As Data Methods
From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”
Political Analysis, 21: 267-297
Computerized Text Analysis: A How To Guide
Step One: Data Collection
An Overview of Text As Data Methods
From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”
Political Analysis, 21: 267-297
Data Are Everywhere
1. Existing Corpora (e.g. Government Collections / University Archives) e.g.
I. POLTEXT – Université Laval II. Project GutenburgIII. Presidential Speech Archive – University of VirginiaIV. Natural Speech Corpora also available
2. Electronic Sources I. Web Scrapping II. Social MediaIII. Blogs
3. Undigitalized Text (e.g. Old manuscripts, treaties, debate proceedings, elections records)
Tip: Many R Packages (e.g. tm, quanteda, lsa, etc.) include corpora you can practice on
Getting Started: Become a Facebook Developer
Google: “Facebook For Developers”Site: https://developers.facebook.com
Note: You will need a FB Account to sign in.
Your page After Registration
To Begin:
1) Click on “My Apps” (top right)
Getting Started: Become a Facebook Developer
1. GOOGLE: “Facebook Developers” 2. Sign in to Facebook – you will need an account3. CLICK on “My Apps”4. Click “Create New App”
Getting Started: Become a Facebook Developer
2) Create App ID
(there will be some pop up windows)
3) Enter security question
4) Choose Platform
5) WWW
6) Skip Quick Start
Getting Started: Become a Facebook Developer
6) Click -> Skip Quick Start
7) Go to Dashboard (automatic)
FROM DASHBOARD
Never share your App Secret!!!!!
8) Copy your App IDD AND your App Secret (the latter will req. you to enter your password)
9) Return to R / RStudio
DASHBOARD / Settings / Basic
Never share your App Secret!!!!!
9) Click Add Platform
10) Select “Website”
11) Paste: http://localhost:1410/ Into Site URL
12) Save Changes
13) Go back to R – Hit Enter
DASHBOARD / Settings / Basic
11) Paste: http://localhost:1410/ Into Site URL
12) Save Changes
13) Go back to R – Hit Enter
That’s it. All ready to go!
<- If successful
Click Continue:
Web browser should display:
Authentication complete. Please close this page and return to R.
Authentication complete. Authentication successful.
Getting Started: Become a Twitter Developer
Your page After Registration
To Begin:Go to: https://dev.twitter.com/
Getting Started: Become a Twitter Developer
1. Click “My Apps”2. Then “Create New App”
You will come to this page
1. Give your app a name2. A description 3. For website type http://test.de/
4. Leave Callback URL blank5. Agree to conditions
6. Create Your Twitter App
Getting Started: Become a Twitter Developer
In R/RStudio
consumer_key <- ‘Your_Consumer_Key’consumer_secret <- ‘Your_Consumer_Secret’access_token <- ‘Your_Token_Here’access_secret <- ‘Your Secret Here’
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
Now that you are a developer…What can you do?
A Short Summary:
Collect data by location – e.g. using Twitter’s “geotags” Collect data by topic – using keywords or hashtags Collect data on groups – using Facebook GroupIDs Select Data according to parameters such as time - e.g. Posts made during the
Canadian Election Campaign (4/Aug/2015 until 19/Oct/2015) Scrape timelines – e.g. President Obama’s @POTUS account Remove noise - preprocessing
A Couple Quick Examples
The Signal Through the NoiseStep Two: Preprocessing (Cleaning up our data)
Preprocessing
“a series of operations on data by a computer in order to retrieve or transform or classify information” (Oxford Dictionary)
Operations that transform words into numbers for the purposes of future analysis
An Overview of Text As Data Methods
From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”
Political Analysis, 21: 267-297
A few common procedures
tm_map(x, FUN,…) – from the “tm” package
docs <- tm_map(docs, removePunctuation)docs <- tm_map(docs, removeNumbers)
docs<- tm_map(docs, tolower)docs <- tm_map(docs, removeWords, myStopwords)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, removeWords, myStopwords)
Stemming / Lemmatization
the process of reducing a word to its most basic form (Porter, 1980)
Party – Parties – Partying Parti*
Common Problem: University Students Partying Rather than Studying Political Parties
A Quick Example
Finding the Signal Through the Noise
Step Three: Analysis
An Overview of Text As Data Methods
From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”
Political Analysis, 21: 267-297
Two Different Approaches
Word Count Strategy Top Down A Bag of Words Approach Relative Frequency of Words
Usually a percentage (target word / total words)
Word Pattern Analysis Bottom up Word association Word covariation
term-doc matrix
See: Pennebaker et al. (2003) “Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54: 549-50.
Latent Semantic AnalysisKnowledge Acquisition and A Theory on Meaning
Chomsky – language acquisition &a universal grammar
Vs.
Locke - The Blank Slate: “no innaterules for processing data”
A Solution to Plato’s Problem
A Must read: Landauer, T.K. & Dumais, S.T. (1997). “A Solution to Plato’s Problem: The Latent Semantic Anaysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review, 104(2), 211-240.
Word Count Strategy – Two Parts(e.g. Linguistic Inquiry and Word Count)
The Processing Component The Dictionary
Package “quanteda” – allows you to import LIWC dictionaries into R
Word Category Examples Psychological Correlate(s)
First-Person Singular I, me, mine Honest, depressed,low status, personal,emotional, informal
First-Person Plural We, Us, Our Detached, high status,socially connected togroup (sometimes)
Third-person singular She, him, her, he Social interests, social support
Articles A, an, the interest in objects andthings, deference to authority
Negative emotion (e.g. anxiety & anger) Hate, angry, mad, worried, concerned Emotional state
Exclusivity But, without, exclude Cognitive complexity,honesty
Future/Past/Present Tense Will, gonna, am, doing, went, ran, had e.g. goal orientations (forward vs past focused)
Social processes Mate, talk, they, child Social concerns, social support
The Heart of Word Count Software – The Dictionary
Caution w/ Correlation
For Fun
Quickly scrap your friends and colleagues’ twitter accounts
(makes use of LIWC’s dictionary)
This is… Anxiety (avoidance-based) Overestimates risk Status-quo oriented (risk-avoidance choices) Risk reduction (concerned with uncertainty)
This is… Anger (approach-based) Underestimates risk Change oriented (risk-seeking choices) Moral anger (addresses injustices “no dessert!”)
LiberalsPersonality (Big 5):
Openness to New Experiences
Conservatives
Personality (Big 5):ContentiousnessNeuroticism
Predisposition != Determinism
Tracking Sentiment Over Time
Frequency of Anger-related words from Facebook Commentators during the 2015 Canadian General Election Campaign
Tracking Sentiment Over Time
Frequency of Positive Emotion words from Facebook Commentators during the 2015 Canadian General Election Campaign
Creating Your Own Dictionaries
Word Category
%01 HarmVirtue02 HarmVice03 FairnessVirtue04 FairnessVice05 IngroupVirtue06 InGroupVice07 AuthorityVirtue08 AuthorityVice09 PurityVirtue10 PurityVice11 MoralityGeneral%
Target Words
compassion* 01empath* 01sympath* 01…class 07Bourgeoisie 07…austerity 09integrity 09 11
Data are organizedHierarchically
* Indicate a word stem
Grimmer & Stewart (2013)
There is no single best method for computerized
text analysis
Four Principles of Automated Text Analysis
1) All quantitative Models of language are wrong – but some are useful
Language models are inherently reductionist2) Quantitative methods for text amplify resources & augment humans
Computers cannot replace humans(…yet)3) There is no globally best method for automated text analysis
3.1) your method will depend on: i) the hypothesis you are testing, and; ii) your source(s) of data
4) Validate, validate, validate
We need to work together…across disciplines
Adapted from – Grimmer & Stewart (2013). “Text as Data: The Promise and Pitfalls of Automated Content Analysis Methods for Political Texts.” Political Analysis.
Regardless of your Method…A key concern is always validation
PsychologicalProcesses
Examples of Dictionary Words
Words in Category
Internal Consistency(Uncorrected α )
InternalConsistency(Corrected α )
Psych. Affect happy, cried 1393 0.18 .57
Pos. Emotions love, nice, sweet 620 0.23 .64
Neg. Emotions hurt, ugly, nasty 744 0.17 .55
Anxiety worried, fearful 116 0.31 .73
Anger hate, kill, annoyed
230 0.16 .53
Sadness crying, grief, sad 136 0.28 .70
Thank you
Questions???