I want to know more about compuerized text analysis

Computerized Text AnalysisThe Practical and Ethical Use of Social Media Data For Social Science Research

Lucas CzarneckiUniversity of Calgary Dept. of Political Science

My Research

Psychological differences between conservatives and liberals

How innate traits like personality, temperament, and moral impulsivities shape ideology

“Do differences in ideology manifest in language use”?

photo by: Luci Gutiérrez

Before I even get started…

Go to www.eventbrite.ca

Using “Find Your Next Experience” search for:

“I Want To Know More About…Computerized Text Analysis”

The Link will take you to:

Please Download…

You will need:

Optional, but highly recommended:

My Two Goals for this Workshop

One Goal

Is to provide practical information (regarding data collection, preprocessing,

and analysis)

Second Goal

To have a wider discussion on text and social media data (the history, and the current state of affairs)

Common Theme Throughout: The Practical and Ethical Use of Social Media Data

What Is Computerized Text Analysis?

Not one method or approach, but many

A Swiss Army Knife for studying language use and text data

any automatized process for categorizing or uncovering latent meaning in word use within or across files with computer-readable formats

History of Text Analysis

1901The Idea has roots to the

earliest days of psychology (e.g. Freud &

Psycholinguistics)

1921Rorschach

Projective TestsLinked Word Use With Psychological Drives

1950sEarly Work in Content

Analysis (still using human coders and

judges)

1966-1978Phillip Stone et al. work

on The First Computerized Text Analysis Program

1990sPrograms like General

Inquirer become available on small

personal computers

1992-1994Earliest work on

Linguistic Inquiry and Word Count

Computerized Text Analysis: Why Now?

“We are in the midst of a technological revolution whereby, for the first time, researcherscan link daily word use to a broad array of real-world behaviors... to detect meaning in a wide

variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.”

Tausczik & Pennebaker & (2007)

ENIAC (1943) – The World’s First* Computer

occupied ~ 1,800 square feet weighed almost 50 tons

…and no pictures of cats

ENIAC – was constructed by the University of Pennsylvania: construction began in 1943…completed in 1946.

IBM’s 1956 5MB Hard-Drive

50 24-inch discs stacked together Took up 16 sq. ft.

Cost IBM ~$35,000 annually

Stored an impressive 5MB of data.

weighed just less than 1 ton…now that’s progress!

A Technological Revolution

Computers are becoming increasingly powerful, while simultaneously less expensive

Graph by: Max Roser – Our World in Data

An Information Revolution

Source: BBVA API-Market

An Information Revolution

Vids/Pics aside… a lot (if notmost) of this is text data

Source: BBVA API-Market

The untapped world of unstructured data

The vast majority of data available online right now is unstructured: Hard to estimate, but about ~80 to 90% of all data online is unstructured. Much of this is text. Roughly 2.5 billion GBs of data are created per day

It is a very exciting time for computerized text analysis!

Before we start data mining…we need to talk about Ethics

Our technology has outpaced our philosophy

In our CommunitiesIn Law and Government

…and in Academia

In Law & Government – e.g. Surveillance Programs

Constitutions/Conventions Evolve Slowly…

…Technology does not

Ethics aside…the NSA Shows us how much Data is really out there

East Germany’s Stasi vs. the NSA

https://apps.opendatacity.de/stasi-vs-nsa/english.html

https://apps.opendatacity.de/stasi-vs-nsa/english.html

Academia – e.g. Facebook Emotion Manipulation Experiment (2014)

Universities Must Update Their EthicsIn Face of New Methods

My Ethics Application

What my research involves…

Scrapped: Posts from Party Leaders (N=1,712)Harper = 525 | Trudeau = 531 | Mulcair = 656

Scrapped: Data on Facebook Users Comments = 297,830 | Likes = 3,296,035 | Shares = 855,381

…none of these people know they are in my study

The process of applying for ethics approval…

The Rigorous Ethics Approval Process

Made my life easier

…but I should not be granted this luxury

Getting Started: A Very, very, very brief Intro to R

Let’s start with scalars

Enter: X <- 3

Print(display) your dataX

OrX = 4

R as a calculator, enter:X * 5

Save calculations as distinctly new objectsY <- X * 5

Now for vectors:

Enter A <- c(2,4,6,8) R – Loves functions (e.g. concatenate, mean, etc)

Functions make life easier

Without functions…calculations would need to be done manually, ex:(2+4+6+8)/4

Or:mean(A)

Vector of character strings!

This will give you an error list_of_names <- c(bob,jane,david,mark,john)

Instead;list_of_names <- c("bob","jane","david","mark","john")

list_of_candidates <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")

Humble beginnings

Let’s start with a vector of numbers, example:v <- c(39.5,31.9,19.7,4.7,3.4)

Assign names to elements, enter:names(v) <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")

Print v to view data:v

Look up specific data:v["Stephen Harper"]

Getting Familiar with Text Data

Sentences from Alice In Wonderland

Alice <- c("But I don't want to go among mad people, Alice remarked", "Oh you can't help that, said the Cat, we're all mad here. I'm mad.

You're mad", "How do you know I’m mad, said Alice","You must be, said the Cat, or you wouldn’t have come here")

Print:Alice

FINALLY, Matrices:

Your first matrix:myMatrix <- matrix(data=c(1,2,3,4,5,6), ncol=3)myMatrix

Also works with text (very common): myMatrix <- matrix(data=c("you", "can", "also", "include", "words", "here"), ncol=3)myMatrix

Functions and Packages

Packages are here to help!!!

install.packages(“tm”)require(tm) OR library(tm)

To Save time:Needed <- c("tm", "lsa", "Rfacebook", "twitteR", "ggplot2", "devtools", "wordcloud", "biclust", "cluster", "igraph", "fpc")lapply(Needed, require, character.only = TRUE)

In a nutshell…why use R?

1) It is open source (freeware)2) It is incredibly flexible3) The R Community (R-bloggers, stack exchange, etc.)4) Increasingly popular in academia5) Multi-disciplinary 6) Constantly evolving features

An Overview of Text As Data Methods

From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”

Political Analysis, 21: 267-297

Computerized Text Analysis: A How To Guide

Step One: Data Collection




Data Are Everywhere

1. Existing Corpora (e.g. Government Collections / University Archives) e.g.

I. POLTEXT – Université Laval II. Project GutenburgIII. Presidential Speech Archive – University of VirginiaIV. Natural Speech Corpora also available

2. Electronic Sources I. Web Scrapping II. Social MediaIII. Blogs

3. Undigitalized Text (e.g. Old manuscripts, treaties, debate proceedings, elections records)

Tip: Many R Packages (e.g. tm, quanteda, lsa, etc.) include corpora you can practice on

Getting Started: Become a Facebook Developer

Google: “Facebook For Developers”Site: https://developers.facebook.com

Note: You will need a FB Account to sign in.

Your page After Registration

To Begin:

1) Click on “My Apps” (top right)

https://developers.facebook.com/


1. GOOGLE: “Facebook Developers” 2. Sign in to Facebook – you will need an account3. CLICK on “My Apps”4. Click “Create New App”


2) Create App ID

(there will be some pop up windows)

3) Enter security question

4) Choose Platform

5) WWW

6) Skip Quick Start


6) Click -> Skip Quick Start

7) Go to Dashboard (automatic)

FROM DASHBOARD

Never share your App Secret!!!!!

8) Copy your App IDD AND your App Secret (the latter will req. you to enter your password)

9) Return to R / RStudio

DASHBOARD / Settings / Basic

Never share your App Secret!!!!!

9) Click Add Platform

10) Select “Website”

11) Paste: http://localhost:1410/ Into Site URL

12) Save Changes

13) Go back to R – Hit Enter

http://localhost:1410/

DASHBOARD / Settings / Basic

11) Paste: http://localhost:1410/ Into Site URL

12) Save Changes

13) Go back to R – Hit Enter

http://localhost:1410/

That’s it. All ready to go!

<- If successful

Click Continue:

Web browser should display:

Authentication complete. Please close this page and return to R.

Authentication complete. Authentication successful.

Getting Started: Become a Twitter Developer

Your page After Registration

To Begin:Go to: https://dev.twitter.com/

https://dev.twitter.com/


1. Click “My Apps”2. Then “Create New App”

You will come to this page

1. Give your app a name2. A description 3. For website type http://test.de/

4. Leave Callback URL blank5. Agree to conditions

6. Create Your Twitter App

http://test.de/


In R/RStudio

consumer_key <- ‘Your_Consumer_Key’consumer_secret <- ‘Your_Consumer_Secret’access_token <- ‘Your_Token_Here’access_secret <- ‘Your Secret Here’

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Now that you are a developer…What can you do?

A Short Summary:

Collect data by location – e.g. using Twitter’s “geotags” Collect data by topic – using keywords or hashtags Collect data on groups – using Facebook GroupIDs Select Data according to parameters such as time - e.g. Posts made during the

Canadian Election Campaign (4/Aug/2015 until 19/Oct/2015) Scrape timelines – e.g. President Obama’s @POTUS account Remove noise - preprocessing

A Couple Quick Examples

The Signal Through the NoiseStep Two: Preprocessing (Cleaning up our data)

Preprocessing

“a series of operations on data by a computer in order to retrieve or transform or classify information” (Oxford Dictionary)

Operations that transform words into numbers for the purposes of future analysis




A few common procedures

tm_map(x, FUN,…) – from the “tm” package

docs <- tm_map(docs, removePunctuation)docs <- tm_map(docs, removeNumbers)

docs<- tm_map(docs, tolower)docs <- tm_map(docs, removeWords, myStopwords)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, PlainTextDocument)docs <- tm_map(docs, stemDocument)

docs <- tm_map(docs, removeWords, myStopwords)

Stemming / Lemmatization

the process of reducing a word to its most basic form (Porter, 1980)

Party – Parties – Partying Parti*

Common Problem: University Students Partying Rather than Studying Political Parties

A Quick Example

Finding the Signal Through the Noise

Step Three: Analysis




Two Different Approaches

Word Count Strategy Top Down A Bag of Words Approach Relative Frequency of Words

Usually a percentage (target word / total words)

Word Pattern Analysis Bottom up Word association Word covariation

term-doc matrix

See: Pennebaker et al. (2003) “Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54: 549-50.

Latent Semantic AnalysisKnowledge Acquisition and A Theory on Meaning

Chomsky – language acquisition &a universal grammar

Vs.

Locke - The Blank Slate: “no innaterules for processing data”

A Solution to Plato’s Problem

A Must read: Landauer, T.K. & Dumais, S.T. (1997). “A Solution to Plato’s Problem: The Latent Semantic Anaysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review, 104(2), 211-240.

Word Count Strategy – Two Parts(e.g. Linguistic Inquiry and Word Count)

The Processing Component The Dictionary

Package “quanteda” – allows you to import LIWC dictionaries into R

Word Category Examples Psychological Correlate(s)

First-Person Singular I, me, mine Honest, depressed,low status, personal,emotional, informal

First-Person Plural We, Us, Our Detached, high status,socially connected togroup (sometimes)

Third-person singular She, him, her, he Social interests, social support

Articles A, an, the interest in objects andthings, deference to authority

Negative emotion (e.g. anxiety & anger) Hate, angry, mad, worried, concerned Emotional state

Exclusivity But, without, exclude Cognitive complexity,honesty

Future/Past/Present Tense Will, gonna, am, doing, went, ran, had e.g. goal orientations (forward vs past focused)

Social processes Mate, talk, they, child Social concerns, social support

The Heart of Word Count Software – The Dictionary

Caution w/ Correlation

For Fun

Quickly scrap your friends and colleagues’ twitter accounts

(makes use of LIWC’s dictionary)

This is… Anxiety (avoidance-based) Overestimates risk Status-quo oriented (risk-avoidance choices) Risk reduction (concerned with uncertainty)

This is… Anger (approach-based) Underestimates risk Change oriented (risk-seeking choices) Moral anger (addresses injustices “no dessert!”)

LiberalsPersonality (Big 5):

Openness to New Experiences

Conservatives

Personality (Big 5):ContentiousnessNeuroticism

Predisposition != Determinism

Tracking Sentiment Over Time

Frequency of Anger-related words from Facebook Commentators during the 2015 Canadian General Election Campaign

Tracking Sentiment Over Time

Frequency of Positive Emotion words from Facebook Commentators during the 2015 Canadian General Election Campaign

Creating Your Own Dictionaries

Word Category

%01 HarmVirtue02 HarmVice03 FairnessVirtue04 FairnessVice05 IngroupVirtue06 InGroupVice07 AuthorityVirtue08 AuthorityVice09 PurityVirtue10 PurityVice11 MoralityGeneral%

Target Words

compassion* 01empath* 01sympath* 01…class 07Bourgeoisie 07…austerity 09integrity 09 11

Data are organizedHierarchically

* Indicate a word stem

Grimmer & Stewart (2013)

There is no single best method for computerized

text analysis

Four Principles of Automated Text Analysis

1) All quantitative Models of language are wrong – but some are useful

Language models are inherently reductionist2) Quantitative methods for text amplify resources & augment humans

Computers cannot replace humans(…yet)3) There is no globally best method for automated text analysis

3.1) your method will depend on: i) the hypothesis you are testing, and; ii) your source(s) of data

4) Validate, validate, validate

We need to work together…across disciplines

Adapted from – Grimmer & Stewart (2013). “Text as Data: The Promise and Pitfalls of Automated Content Analysis Methods for Political Texts.” Political Analysis.

Regardless of your Method…A key concern is always validation

PsychologicalProcesses

Examples of Dictionary Words

Words in Category

Internal Consistency(Uncorrected α )

InternalConsistency(Corrected α )

Psych. Affect happy, cried 1393 0.18 .57

Pos. Emotions love, nice, sweet 620 0.23 .64

Neg. Emotions hurt, ugly, nasty 744 0.17 .55

Anxiety worried, fearful 116 0.31 .73

Anger hate, kill, annoyed

230 0.16 .53

Sadness crying, grief, sad 136 0.28 .70

Thank you

Questions???

Education

I want to know more about compuerized text analysis