66
Computerized Text Analysis The Practical and Ethical Use of Social Media Data For Social Science Research Lucas Czarnecki University of Calgary Dept. of Political Science

I want to know more about compuerized text analysis

Embed Size (px)

Citation preview

Page 1: I want to know more about   compuerized text analysis

Computerized Text AnalysisThe Practical and Ethical Use of Social Media Data For Social Science Research

Lucas CzarneckiUniversity of Calgary Dept. of Political Science

Page 2: I want to know more about   compuerized text analysis

My Research

Psychological differences between conservatives and liberals

How innate traits like personality, temperament, and moral impulsivities shape ideology

“Do differences in ideology manifest in language use”?

photo by: Luci Gutiérrez

Page 3: I want to know more about   compuerized text analysis

Before I even get started…

Go to www.eventbrite.ca

Using “Find Your Next Experience” search for:

“I Want To Know More About…Computerized Text Analysis”

The Link will take you to:

Page 4: I want to know more about   compuerized text analysis

Please Download…

You will need:

Optional, but highly recommended:

Page 5: I want to know more about   compuerized text analysis

My Two Goals for this Workshop

One Goal

Is to provide practical information (regarding data collection, preprocessing,

and analysis)

Second Goal

To have a wider discussion on text and social media data (the history, and the current state of affairs)

Common Theme Throughout: The Practical and Ethical Use of Social Media Data

Page 6: I want to know more about   compuerized text analysis

What Is Computerized Text Analysis?

Not one method or approach, but many

A Swiss Army Knife for studying language use and text data

any automatized process for categorizing or uncovering latent meaning in word use within or across files with computer-readable formats

Page 7: I want to know more about   compuerized text analysis

History of Text Analysis

1901The Idea has roots to the

earliest days of psychology (e.g. Freud &

Psycholinguistics)

1921Rorschach

Projective TestsLinked Word Use With Psychological Drives

1950sEarly Work in Content

Analysis (still using human coders and

judges)

1966-1978Phillip Stone et al. work

on The First Computerized Text Analysis Program

1990sPrograms like General

Inquirer become available on small

personal computers

1992-1994Earliest work on

Linguistic Inquiry and Word Count

Page 8: I want to know more about   compuerized text analysis

Computerized Text Analysis: Why Now?

“We are in the midst of a technological revolution whereby, for the first time, researcherscan link daily word use to a broad array of real-world behaviors... to detect meaning in a wide

variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.”

Tausczik & Pennebaker & (2007)

Page 9: I want to know more about   compuerized text analysis

ENIAC (1943) – The World’s First* Computer

occupied ~ 1,800 square feet weighed almost 50 tons

…and no pictures of cats

ENIAC – was constructed by the University of Pennsylvania: construction began in 1943…completed in 1946.

Page 10: I want to know more about   compuerized text analysis

IBM’s 1956 5MB Hard-Drive

50 24-inch discs stacked together Took up 16 sq. ft.

Cost IBM ~$35,000 annually

Stored an impressive 5MB of data.

weighed just less than 1 ton…now that’s progress!

Page 11: I want to know more about   compuerized text analysis

A Technological Revolution

Computers are becoming increasingly powerful, while simultaneously less expensive

Graph by: Max Roser – Our World in Data

Page 12: I want to know more about   compuerized text analysis

An Information Revolution

Source: BBVA API-Market

Page 13: I want to know more about   compuerized text analysis

An Information Revolution

Vids/Pics aside… a lot (if notmost) of this is text data

Source: BBVA API-Market

Page 14: I want to know more about   compuerized text analysis

The untapped world of unstructured data

The vast majority of data available online right now is unstructured: Hard to estimate, but about ~80 to 90% of all data online is unstructured. Much of this is text. Roughly 2.5 billion GBs of data are created per day

It is a very exciting time for computerized text analysis!

Page 15: I want to know more about   compuerized text analysis

Before we start data mining…we need to talk about Ethics

Our technology has outpaced our philosophy

In our CommunitiesIn Law and Government

…and in Academia

Page 16: I want to know more about   compuerized text analysis

In Law & Government – e.g. Surveillance Programs

Constitutions/Conventions Evolve Slowly…

…Technology does not

Page 17: I want to know more about   compuerized text analysis

Ethics aside…the NSA Shows us how much Data is really out there

East Germany’s Stasi vs. the NSA

https://apps.opendatacity.de/stasi-vs-nsa/english.html

Page 18: I want to know more about   compuerized text analysis

Academia – e.g. Facebook Emotion Manipulation Experiment (2014)

Universities Must Update Their EthicsIn Face of New Methods

Page 19: I want to know more about   compuerized text analysis

My Ethics Application

What my research involves…

Scrapped: Posts from Party Leaders (N=1,712)Harper = 525 | Trudeau = 531 | Mulcair = 656

Scrapped: Data on Facebook Users Comments = 297,830 | Likes = 3,296,035 | Shares = 855,381

…none of these people know they are in my study

The process of applying for ethics approval…

Page 20: I want to know more about   compuerized text analysis

The Rigorous Ethics Approval Process

Made my life easier

…but I should not be granted this luxury

Page 21: I want to know more about   compuerized text analysis

Getting Started: A Very, very, very brief Intro to R

Page 22: I want to know more about   compuerized text analysis

Let’s start with scalars

Enter: X <- 3

Print(display) your dataX

OrX = 4

R as a calculator, enter:X * 5

Save calculations as distinctly new objectsY <- X * 5

Page 23: I want to know more about   compuerized text analysis

Now for vectors:

Enter A <- c(2,4,6,8) R – Loves functions (e.g. concatenate, mean, etc)

Functions make life easier

Without functions…calculations would need to be done manually, ex:(2+4+6+8)/4

Or:mean(A)

Page 24: I want to know more about   compuerized text analysis

Vector of character strings!

This will give you an error list_of_names <- c(bob,jane,david,mark,john)

Instead;list_of_names <- c("bob","jane","david","mark","john")

list_of_candidates <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")

Page 25: I want to know more about   compuerized text analysis

Humble beginnings

Let’s start with a vector of numbers, example:v <- c(39.5,31.9,19.7,4.7,3.4)

Assign names to elements, enter:names(v) <- c("Justin Trudeau", "Stephen Harper", "Tom Mulcair", "Gilles Duceppe", "Elizabeth May")

Print v to view data:v

Look up specific data:v["Stephen Harper"]

Page 26: I want to know more about   compuerized text analysis

Getting Familiar with Text Data

Sentences from Alice In Wonderland

Alice <- c("But I don't want to go among mad people, Alice remarked", "Oh you can't help that, said the Cat, we're all mad here. I'm mad.

You're mad", "How do you know I’m mad, said Alice","You must be, said the Cat, or you wouldn’t have come here")

Print:Alice

Page 27: I want to know more about   compuerized text analysis

FINALLY, Matrices:

Your first matrix:myMatrix <- matrix(data=c(1,2,3,4,5,6), ncol=3)myMatrix

Also works with text (very common): myMatrix <- matrix(data=c("you", "can", "also", "include", "words", "here"), ncol=3)myMatrix

Page 28: I want to know more about   compuerized text analysis

Functions and Packages

Packages are here to help!!!

install.packages(“tm”)require(tm) OR library(tm)

To Save time:Needed <- c("tm", "lsa", "Rfacebook", "twitteR", "ggplot2", "devtools", "wordcloud", "biclust", "cluster", "igraph", "fpc")lapply(Needed, require, character.only = TRUE)

Page 29: I want to know more about   compuerized text analysis

In a nutshell…why use R?

1) It is open source (freeware)2) It is incredibly flexible3) The R Community (R-bloggers, stack exchange, etc.)4) Increasingly popular in academia5) Multi-disciplinary 6) Constantly evolving features

Page 30: I want to know more about   compuerized text analysis

An Overview of Text As Data Methods

From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”

Political Analysis, 21: 267-297

Page 31: I want to know more about   compuerized text analysis

Computerized Text Analysis: A How To Guide

Step One: Data Collection

Page 32: I want to know more about   compuerized text analysis

An Overview of Text As Data Methods

From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”

Political Analysis, 21: 267-297

Page 33: I want to know more about   compuerized text analysis

Data Are Everywhere

1. Existing Corpora (e.g. Government Collections / University Archives) e.g.

I. POLTEXT – Université Laval II. Project GutenburgIII. Presidential Speech Archive – University of VirginiaIV. Natural Speech Corpora also available

2. Electronic Sources I. Web Scrapping II. Social MediaIII. Blogs

3. Undigitalized Text (e.g. Old manuscripts, treaties, debate proceedings, elections records)

Tip: Many R Packages (e.g. tm, quanteda, lsa, etc.) include corpora you can practice on

Page 34: I want to know more about   compuerized text analysis

Getting Started: Become a Facebook Developer

Google: “Facebook For Developers”Site: https://developers.facebook.com

Note: You will need a FB Account to sign in.

Your page After Registration

To Begin:

1) Click on “My Apps” (top right)

Page 35: I want to know more about   compuerized text analysis

Getting Started: Become a Facebook Developer

1. GOOGLE: “Facebook Developers” 2. Sign in to Facebook – you will need an account3. CLICK on “My Apps”4. Click “Create New App”

Page 36: I want to know more about   compuerized text analysis

Getting Started: Become a Facebook Developer

2) Create App ID

(there will be some pop up windows)

3) Enter security question

4) Choose Platform

5) WWW

6) Skip Quick Start

Page 37: I want to know more about   compuerized text analysis

Getting Started: Become a Facebook Developer

6) Click -> Skip Quick Start

7) Go to Dashboard (automatic)

Page 38: I want to know more about   compuerized text analysis

FROM DASHBOARD

Never share your App Secret!!!!!

8) Copy your App IDD AND your App Secret (the latter will req. you to enter your password)

9) Return to R / RStudio

Page 39: I want to know more about   compuerized text analysis

DASHBOARD / Settings / Basic

Never share your App Secret!!!!!

9) Click Add Platform

10) Select “Website”

11) Paste: http://localhost:1410/ Into Site URL

12) Save Changes

13) Go back to R – Hit Enter

Page 40: I want to know more about   compuerized text analysis

DASHBOARD / Settings / Basic

11) Paste: http://localhost:1410/ Into Site URL

12) Save Changes

13) Go back to R – Hit Enter

Page 41: I want to know more about   compuerized text analysis

That’s it. All ready to go!

<- If successful

Click Continue:

Web browser should display:

Authentication complete. Please close this page and return to R.

Authentication complete. Authentication successful.

Page 42: I want to know more about   compuerized text analysis

Getting Started: Become a Twitter Developer

Your page After Registration

To Begin:Go to: https://dev.twitter.com/

Page 43: I want to know more about   compuerized text analysis

Getting Started: Become a Twitter Developer

1. Click “My Apps”2. Then “Create New App”

You will come to this page

1. Give your app a name2. A description 3. For website type http://test.de/

4. Leave Callback URL blank5. Agree to conditions

6. Create Your Twitter App

Page 44: I want to know more about   compuerized text analysis

Getting Started: Become a Twitter Developer

In R/RStudio

consumer_key <- ‘Your_Consumer_Key’consumer_secret <- ‘Your_Consumer_Secret’access_token <- ‘Your_Token_Here’access_secret <- ‘Your Secret Here’

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Page 45: I want to know more about   compuerized text analysis

Now that you are a developer…What can you do?

A Short Summary:

Collect data by location – e.g. using Twitter’s “geotags” Collect data by topic – using keywords or hashtags Collect data on groups – using Facebook GroupIDs Select Data according to parameters such as time - e.g. Posts made during the

Canadian Election Campaign (4/Aug/2015 until 19/Oct/2015) Scrape timelines – e.g. President Obama’s @POTUS account Remove noise - preprocessing

Page 46: I want to know more about   compuerized text analysis

A Couple Quick Examples

Page 47: I want to know more about   compuerized text analysis

The Signal Through the NoiseStep Two: Preprocessing (Cleaning up our data)

Page 48: I want to know more about   compuerized text analysis

Preprocessing

“a series of operations on data by a computer in order to retrieve or transform or classify information” (Oxford Dictionary)

Operations that transform words into numbers for the purposes of future analysis

Page 49: I want to know more about   compuerized text analysis

An Overview of Text As Data Methods

From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”

Political Analysis, 21: 267-297

Page 50: I want to know more about   compuerized text analysis

A few common procedures

tm_map(x, FUN,…) – from the “tm” package

docs <- tm_map(docs, removePunctuation)docs <- tm_map(docs, removeNumbers)

docs<- tm_map(docs, tolower)docs <- tm_map(docs, removeWords, myStopwords)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, PlainTextDocument)docs <- tm_map(docs, stemDocument)

docs <- tm_map(docs, removeWords, myStopwords)

Page 51: I want to know more about   compuerized text analysis

Stemming / Lemmatization

the process of reducing a word to its most basic form (Porter, 1980)

Party – Parties – Partying Parti*

Common Problem: University Students Partying Rather than Studying Political Parties

Page 52: I want to know more about   compuerized text analysis

A Quick Example

Page 53: I want to know more about   compuerized text analysis

Finding the Signal Through the Noise

Step Three: Analysis

Page 54: I want to know more about   compuerized text analysis

An Overview of Text As Data Methods

From: Justin Grimmer & Brandon Stewart’s (2013) “The Promises and Pitfalls of Automated Content Analysis For Political Texts”

Political Analysis, 21: 267-297

Page 55: I want to know more about   compuerized text analysis

Two Different Approaches

Word Count Strategy Top Down A Bag of Words Approach Relative Frequency of Words

Usually a percentage (target word / total words)

Word Pattern Analysis Bottom up Word association Word covariation

term-doc matrix

See: Pennebaker et al. (2003) “Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology 54: 549-50.

Page 56: I want to know more about   compuerized text analysis

Latent Semantic AnalysisKnowledge Acquisition and A Theory on Meaning

Chomsky – language acquisition &a universal grammar

Vs.

Locke - The Blank Slate: “no innaterules for processing data”

A Solution to Plato’s Problem

A Must read: Landauer, T.K. & Dumais, S.T. (1997). “A Solution to Plato’s Problem: The Latent Semantic Anaysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review, 104(2), 211-240.

Page 57: I want to know more about   compuerized text analysis

Word Count Strategy – Two Parts(e.g. Linguistic Inquiry and Word Count)

The Processing Component The Dictionary

Package “quanteda” – allows you to import LIWC dictionaries into R

Page 58: I want to know more about   compuerized text analysis

Word Category Examples Psychological Correlate(s)

First-Person Singular I, me, mine Honest, depressed,low status, personal,emotional, informal

First-Person Plural We, Us, Our Detached, high status,socially connected togroup (sometimes)

Third-person singular She, him, her, he Social interests, social support

Articles A, an, the interest in objects andthings, deference to authority

Negative emotion (e.g. anxiety & anger) Hate, angry, mad, worried, concerned Emotional state

Exclusivity But, without, exclude Cognitive complexity,honesty

Future/Past/Present Tense Will, gonna, am, doing, went, ran, had e.g. goal orientations (forward vs past focused)

Social processes Mate, talk, they, child Social concerns, social support

The Heart of Word Count Software – The Dictionary

Caution w/ Correlation

Page 59: I want to know more about   compuerized text analysis

For Fun

Quickly scrap your friends and colleagues’ twitter accounts

(makes use of LIWC’s dictionary)

Page 60: I want to know more about   compuerized text analysis

This is… Anxiety (avoidance-based) Overestimates risk Status-quo oriented (risk-avoidance choices) Risk reduction (concerned with uncertainty)

This is… Anger (approach-based) Underestimates risk Change oriented (risk-seeking choices) Moral anger (addresses injustices “no dessert!”)

LiberalsPersonality (Big 5):

Openness to New Experiences

Conservatives

Personality (Big 5):ContentiousnessNeuroticism

Predisposition != Determinism

Page 61: I want to know more about   compuerized text analysis

Tracking Sentiment Over Time

Frequency of Anger-related words from Facebook Commentators during the 2015 Canadian General Election Campaign

Page 62: I want to know more about   compuerized text analysis

Tracking Sentiment Over Time

Frequency of Positive Emotion words from Facebook Commentators during the 2015 Canadian General Election Campaign

Page 63: I want to know more about   compuerized text analysis

Creating Your Own Dictionaries

Word Category

%01 HarmVirtue02 HarmVice03 FairnessVirtue04 FairnessVice05 IngroupVirtue06 InGroupVice07 AuthorityVirtue08 AuthorityVice09 PurityVirtue10 PurityVice11 MoralityGeneral% 

Target Words

compassion* 01empath* 01sympath* 01…class 07Bourgeoisie 07…austerity 09integrity 09 11

Data are organizedHierarchically

* Indicate a word stem

Page 64: I want to know more about   compuerized text analysis

Grimmer & Stewart (2013)

There is no single best method for computerized

text analysis

Four Principles of Automated Text Analysis

1) All quantitative Models of language are wrong – but some are useful

Language models are inherently reductionist2) Quantitative methods for text amplify resources & augment humans

Computers cannot replace humans(…yet)3) There is no globally best method for automated text analysis

3.1) your method will depend on: i) the hypothesis you are testing, and; ii) your source(s) of data

4) Validate, validate, validate

We need to work together…across disciplines

Adapted from – Grimmer & Stewart (2013). “Text as Data: The Promise and Pitfalls of Automated Content Analysis Methods for Political Texts.” Political Analysis.

Page 65: I want to know more about   compuerized text analysis

Regardless of your Method…A key concern is always validation

PsychologicalProcesses

Examples of Dictionary Words

Words in Category

Internal Consistency(Uncorrected α )

InternalConsistency(Corrected α )

Psych. Affect happy, cried 1393 0.18 .57

Pos. Emotions love, nice, sweet 620 0.23 .64

Neg. Emotions hurt, ugly, nasty 744 0.17 .55

Anxiety worried, fearful 116 0.31 .73

Anger hate, kill, annoyed

230 0.16 .53

Sadness crying, grief, sad 136 0.28 .70

Page 66: I want to know more about   compuerized text analysis

Thank you

Questions???