Text Mining with R for Social Science Research

Preview:

Citation preview

Text Mining with R for Social Science Research

Ryan Wesslen

OutlineHour 1: Fundamentals of Text Mining with R•Why text, examples and AlchemyAPI demo (10 min)• “Bag of Words” Overview (10 min)• Text Preprocessing & Visualizations using R Studio (30 min)• 10 minute break

Hour 2: Applications• Federalist Papers (History/Political Science) – 30 min• Naïve Bayes Classifier using Word Occurrences• K-Nearest Neighbor Classifier using Topic Modeling

• Federal Reserve Beige Book (Economics) – 30 min• Lexicon-based Sentiment Analysis

Objective• Learn the basics of text preprocessing using R.

• Learn the “bag of words” approach to text analytics (tokenize, cleaning, word cloud, associations, visualizations)

• Run three text mining applications for social sciences: Text Classification (Naïve Bayes & K-Nearest Neighbors), Topic Modeling (LDA) and Lexicon-based Sentiment Analysis• The Federalist papers (Poli/Hist) and the Fed Reserve Beige Book

(Econ)

• Learn resources (papers, textbooks, lecture notes, blogs) to encourage further research in text mining and natural language processing (NLP).

Why analyze text?•Growing• Interesting•Untapped

Big Data: Internetlivestats.com

Language Technology

Coreference resolution

Question answering (QA)

Part-of-speech (POS) tagging

Word sense disambiguation (WSD)Paraphrase

Named entity recognition (NER)

ParsingSummarization

Information extraction (IE)

Machine translation (MT)Dialog

Sentiment analysis

mostly solved

making good progress

still really hard

Spam detection (Classification)Let’s go to Agra!

Buy V1AGRA …

✓✗

Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV

Einstein met with UN officials in PrincetonPERSON ORG LOC

You’re invited to our dinner party, Friday May 27 at 8:30

PartyMay 27add

Best roast chicken in San Francisco!

The waiter ignored us for 20 minutes.

Carter told Mubarak he shouldn’t run again.

I need new batteries for my mouse.

The 13th Shanghai International Film Festival…

第 13届上海国际电影节开幕…

The Dow Jones is up

Housing prices rose

Economy is good

Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?

I can see Alcatraz from the window!

XYZ acquired ABC yesterdayABC has been taken over by XYZ

Where is Citizen Kane playing in SF?

Castro Theatre at 7:30. Do you want a ticket?

The S&P500 jumped

Source: Dan Jurafsky

Why else is text mining difficult?non-standard English

Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥

segmentation issues idiomsdark horse

get cold feetlose face

throw in the towel

neologisms

unfriendRetweet

bromance

tricky entity names

Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …

the New York-New Haven Railroadthe New York-New Haven Railroad

Source: Dan Jurafsky (modified)

sarcasm

A: I love Justin Bieber. Do you like him to?B: Yeah. Sure. I absolutely love him.

AlchemyAPI Example•Go to http://www.alchemyapi.com/• Click on the homepage• As an introduction, copy/paste Federalist Paper #10: https://www.congress.gov/resources/display/content/The+Federalist+Papers#TheFederalistPapers-10• Click and explore!

Federalist Papers & Text Classification

1:10pm

Alexander Hamilton

Genius.com’s “Non-Stop” Lyrics

Federalist Paper setup•Not so true story, bro (about how many papers each wrote)• Reality: the authorship of twelve papers is disputed• Hamilton claimed authorship before he was killed; Madison

disputed those claims eight years later. • Adair (1944), Moesteller & Wallace (1963), Fung (2003),

Collins et al (2004)

• Three tasks:• Pre-process and exploratory data (word cloud, associations,

etc.)• Naïve Bayes Classification to predict author of the 12 disputed

papers based on word counts. • Topic modeling to identify key themes and k-nearest neighbors

to predict author based on papers’ topics.

Basic Text Terminology

Corpus

Document

Term

“Bag of Words” Approach

• Simplest way to quantify text• Counts the term count per

document• Document-Term Matrix

• Ignores word order

• N-grams (uni-,bi-,tri-, etc)• Good at classification

• Like Spam Filter• Bad at semantic meaning

Source: Chris Manning

Preprocessing

• Tokenization• Cleaning: Lower case, white space, punctuation• Stemming, Lemmatization and/or Collocations• Filter: remove stop words

Tokenize Clean Stem Filter

Then a hurricane came, and devastation reigned

then a hurricane came and devastation reigned

then a hurricane came and devastation reigned

then a hurricane came and devastation reigned

Part 1: R Studio, Working Directory & R Packages•Open R Studio and FederalistPapers.R (see GitHub site)

1:20pm Code Lines: 1 - 49

Part 2: Load csv (text) file & view

Code Lines: 50-79

Part 3a: Pre-processing

Federalist Paper 1: Before

Federalist Paper 1: After

Code Lines: 71-88

Part 3b: Additional Pre-processing

Federalist Paper 1: After

Code Lines: 89-104

Part 4a: Document-Term Matrices

Code Lines: 142-149

Part 4b: Word CloudCode Lines: 151-165

1:30pm

Part 4c: Term FrequenciesCode Lines: 167-171

Part 4d: Word AssociationsCode Lines: 173-188

Part 4e: Word Clustering: Hierarchal

Code Lines: 189-201

Part 4f: K-Means Word Clustering

Code Lines: 202-207

Part 5: Redo with Stemming, Bigrams and additional stop words

Uncomment (CTRL + SHIFT + C) and run lines 107-139

Code Lines: 107-139then rerun lines 141-206

Classification Models (Overview)• Classification models predict class labels

•Class labels = categories• For example, binary (yes or no), ordinal (high, medium, low) or

nominal (dog, cat, kangaroo)

• Classification models are a type of supervised learning as the class labels (“y variables”) are known (observed).

•Determining the disputed Federalist papers is a binary classification problem as the author of the distputed papers is one of two authors: Hamilton or Madison.

1:50pm - 2pm

Types of Classification Models• There are many different models (algorithms) that can be used for classification problems.

• Examples: Logistic Regression, Decision Tree, Support Vector Machine, Neural Networks

•We are going to use Naïve Bayes and k-nearest neighbors.

•We will use different feature variables (X variables)• Naïve Bayes = Word Presence (1/0) as X Variables• k-Nearest Neighbors = Topic Probabilities as X Variables

Naïve Bayes• Naïve Bayes is an algorithm based on Bayes Theorem (conditional probability)

• Updates the probability of the predicted class (e.g. who is the author) based on words found in the class (author’s papers).• Example in Spam Filters: The word “Viagra” increases odds an

email is spam

• Pro: Simple, can handle many features (x variables) • Con: Difficult to interpret, subject to assumptions (e.g. independence of x variables)

• See these slides for a deeper overview of Naïve Bayes and its assumptions.

Basics of Predictive Modeling• In predictive modeling, datasets are divided into training and test (sometimes called validation)

• Federalist Papers:• Training Dataset = 65 papers* with known author (known label)• Test Dataset = 12 papers with disputed author (missing label)*Excludes papers written by John Jay (five) and written by both Madison & Hamilton (three)

• Our objective is to build a model that successfully predicts the training dataset authors (accuracy).

• After building the model, apply it to the test dataset to predict the authors for the 12 disputed papers.

Part 6: Naïve Bayes Pre-Processing

Code Lines: 208-219

Conditional Probabilities

Update

Code Lines: 231-241

Odds RatiosCode Lines: 242-248

Train Naïve Bayes and Predict Training Dataset

Code Lines: 250-273

Predict Test (Disputed) Dataset

Code Lines: 275-290

Part 7: Running Topic Modeling

This will take about 4 mins, depending on the computer you run it on

Code Lines: 295-308

Topic Modeling Overview

Source: David Blei (link to article)

Create LDAVis Tool & Label Topics

Code Lines: 295-308

LDAVis Package to Visualize Topics

Index.html file in the “Federalist” folder in your working directory. Open with FireFox; it is not supported by Chrome or IE.

Topic Clustering Heatmap via R Shiny

Code Lines: 321-349

Topic Clustering (Plot)

Nearest NeighborsCode Lines: 350-370

Predictions

• Naïve Bayes predicts 9 of the 12 papers as written by Madison.

• K-NN predicts only 4 of the 12 papers as written by Madison

• Why? How stable are these results??

Code Lines: 371-373

Beige Book & Sentiment Analysis

2:30pm

Sentiment Analysis• Two main types of textual information. • Facts and Opinions

• Search engines are optimized for facts.

• Sentiment Analysis is a growing attempt (not completely solved) to optimize the discovery of opinions.

•Opinions Mining or Sentiment Analysis is an attempt to recognize the opinion or sentiment that a person holds toward an object.

Source: Richard Heimann

Where do we find sentiment?•Movie / Books: Are the reviews on this movie/book positive/negative?

• Product Sales: What is thought of the new iPhone?

• Public Sentiment: How do consumers feel about the economy? How is consumer sentiment effecting sales by sector?

• Politics: How are voters polarized, if at all around a candidate or policy?

• Prediction: Stock Prices, Election Outcomes, Market Trends, Product Sales

Source: Richard Heimann

Three Types of Sentiment Analysis•Dictionary Based Sentiment Analysis • i.e. Is an attitude toward an object positive or negative? • Build dictionary of positive / negative words and count

net occurrence• Supervised Learning for Sentiment Analysis. • i.e. Given data we have seen in the past, can we predict

class assignment for our polarity measure (positive/neutral/negative) • e.g. Naive Bayes, MaxEnt, SVM

•Unsupervised Sentiment Analysis • i.e. No dictionaries. No labeled data. No training

algorithms. And, scale words (often bi-grams) and users on a single dimension. • e.g. latent variable models – Item Response Theory (IRT)

Source: Richard Heimann

Federal Reserve Beige Book• The Beige Book is a report published by the United States Federal Research Board (FRB) eight times a year. • The Beige Book has been in publication since 1985 and is now published online. • The report is published by each (n=12) of the Federal Reserve Bank districts.• The content is rather anecdotal. The report interviews key business contacts, economists, market experts, and others to get their opinion about the economy. • The data used in this book can be found on GitHub, as well as the Python code for all the scraping and parsing.

Source: Richard Heimann

Beige Book Case Study: Initial Steps• Step 1: Download SentimentBeigeBook.zip from https://github.com/wesslen/BeigeBookSentimentAnalysis

• Step 2: Save into a local directory. Open R Studio

• Step 3: Open “sentiment_analysis.R” and “sentiment.R”

Step 1: Working Directory• Run “sentiment.R”. This counts the net number of positive minus negative words in the document given the sentiment (lexicon) dictionary. It will be used later on.

• Set working directory based on where you downloaded the zip file contents.

•Note: For Windows, use “C:\Directory\Folder\” formatting

Step 2: Import in Dictionaries

Step 3: Import Corpus / Text

Step 4: Pre-Processing

Step 5: Create corpus & tokenize

Step 6: Stemming & Stop Words

Step 7: Term-Document Matrix

Step 8: Explore Common Words

Step 9: Add more pos/neg words

Step 10: Word Associations

Step 11: Word Cloud

Step 12: Sentiment Scoring

First six records of BB.sentiment

Step 13: Normalizing Scores

First six records of BB.sentiment (updated)

Step 14: Score Histograms

Raw Scored Sentiment

Scaled Scored Sentiment

Step 15: Plot Historical Sentiment

Step 16: Run beigebookplots.R

Concluding Thoughts• Bag of words approach is a simple text mining framework

• Works well for exploratory analysis, classification and basic sentiment analysis.• Deeper models are needed to identify semantic meaning (e.g. GloVe, recurrent

neural networks, see Stanford Deep Learning NLP class materials)

• R is a great tool for simple, visual-based text mining• However, R has limitations (scale, functions, etc.)• Python (nltk) and Java are better for large-scale, PhD-dissertation research

• Text mining is an iterative process• There is not a single model or method that always works – depends on context!• If your initial results are vague, enhance pre-processing

• e.g. remove more stop words, try bi/trigrams, try stemming or lemmas, customize lexicon

• Text mining is an art, not a science.• Need domain experience AND algorithms• If you’re a social scientist, make friends with a computer scientist (and vice

versa).

Project Mosaic & Next Workshop• Project Mosaic offers consulting, workshops and other collaborative research opportunities.• Upcoming Workshops:

https://projectmosaic.uncc.edu/events-list/

•Next month workshop on Text Mining for Twitter• Will include reference to SOPHI, UNCC Data Science

Initiative’s data warehouse that includes GNIP access to historical Twitter data.• If you are planning on attending, please register for

credentials for a Twitter API before the workshop.• Follow these instructions (to set up with R connector): http

://www.r-bloggers.com/setting-up-the-twitter-r-package-for-text-analytics/

Proprietary Text Mining Tools• AlchemyAPI • limited free use

• Taste Analytics Signals • Two week free trial

• SAS Enterprise Miner • student version available via UNCC

• SAS Sentiment Analysis • available on some UNCC cpu’s

Hamilton Soundtrack Amazon Reviews

Open Source Text Mining Tools• R tm package• Great for simple analysis but difficult for more complex

analysis.

• Python nltk package• Probably one of the best open source text mining packages

• Python gensim package • Another fantastic Python package – focuses on Topic modeling

•Mallet • Great NLP toolkit but requires background in Java and

command line

Online Text / NLP Courses• Introductory / Intermediate:• Dan Juravsky (Stanford),

Introductory Text Mining Class• Chris Manning / Dan Juravsky (Standford),

Coursera Natural Language Processing Class• ChengXiang Zhai (Univ Illinois Champaign Urbana),

Coursera Text Mining & Analytics Course

• Advanced (but way cool, cutting edge stuff):• Richard Socher (Stanford),

Deep Learning for Natural Language Processing

Blogs• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

• http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis

• https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/

• http://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/

•Want more? Follow this link for all R “text” blogs on Rbloggers website

Recommended