Upload
ryan-wesslen
View
387
Download
11
Embed Size (px)
Citation preview
Text Mining with R for Social Science Research
Ryan Wesslen
OutlineHour 1: Fundamentals of Text Mining with R•Why text, examples and AlchemyAPI demo (10 min)• “Bag of Words” Overview (10 min)• Text Preprocessing & Visualizations using R Studio (30 min)• 10 minute break
Hour 2: Applications• Federalist Papers (History/Political Science) – 30 min• Naïve Bayes Classifier using Word Occurrences• K-Nearest Neighbor Classifier using Topic Modeling
• Federal Reserve Beige Book (Economics) – 30 min• Lexicon-based Sentiment Analysis
Objective• Learn the basics of text preprocessing using R.
• Learn the “bag of words” approach to text analytics (tokenize, cleaning, word cloud, associations, visualizations)
• Run three text mining applications for social sciences: Text Classification (Naïve Bayes & K-Nearest Neighbors), Topic Modeling (LDA) and Lexicon-based Sentiment Analysis• The Federalist papers (Poli/Hist) and the Fed Reserve Beige Book
(Econ)
• Learn resources (papers, textbooks, lecture notes, blogs) to encourage further research in text mining and natural language processing (NLP).
Why analyze text?•Growing• Interesting•Untapped
Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)Paraphrase
Named entity recognition (NER)
ParsingSummarization
Information extraction (IE)
Machine translation (MT)Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection (Classification)Let’s go to Agra!
Buy V1AGRA …
✓✗
Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in PrincetonPERSON ORG LOC
You’re invited to our dinner party, Friday May 27 at 8:30
PartyMay 27add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th Shanghai International Film Festival…
第 13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is good
Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterdayABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do you want a ticket?
The S&P500 jumped
Source: Dan Jurafsky
Why else is text mining difficult?non-standard English
Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥
segmentation issues idiomsdark horse
get cold feetlose face
throw in the towel
neologisms
unfriendRetweet
bromance
tricky entity names
Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …
the New York-New Haven Railroadthe New York-New Haven Railroad
Source: Dan Jurafsky (modified)
sarcasm
A: I love Justin Bieber. Do you like him to?B: Yeah. Sure. I absolutely love him.
AlchemyAPI Example•Go to http://www.alchemyapi.com/• Click on the homepage• As an introduction, copy/paste Federalist Paper #10: https://www.congress.gov/resources/display/content/The+Federalist+Papers#TheFederalistPapers-10• Click and explore!
Federalist Papers & Text Classification
1:10pm
Alexander Hamilton
Federalist Paper setup•Not so true story, bro (about how many papers each wrote)• Reality: the authorship of twelve papers is disputed• Hamilton claimed authorship before he was killed; Madison
disputed those claims eight years later. • Adair (1944), Moesteller & Wallace (1963), Fung (2003),
Collins et al (2004)
• Three tasks:• Pre-process and exploratory data (word cloud, associations,
etc.)• Naïve Bayes Classification to predict author of the 12 disputed
papers based on word counts. • Topic modeling to identify key themes and k-nearest neighbors
to predict author based on papers’ topics.
Basic Text Terminology
Corpus
Document
Term
“Bag of Words” Approach
• Simplest way to quantify text• Counts the term count per
document• Document-Term Matrix
• Ignores word order
• N-grams (uni-,bi-,tri-, etc)• Good at classification
• Like Spam Filter• Bad at semantic meaning
Source: Chris Manning
Preprocessing
• Tokenization• Cleaning: Lower case, white space, punctuation• Stemming, Lemmatization and/or Collocations• Filter: remove stop words
Tokenize Clean Stem Filter
Then a hurricane came, and devastation reigned
then a hurricane came and devastation reigned
then a hurricane came and devastation reigned
then a hurricane came and devastation reigned
Part 1: R Studio, Working Directory & R Packages•Open R Studio and FederalistPapers.R (see GitHub site)
1:20pm Code Lines: 1 - 49
Part 2: Load csv (text) file & view
Code Lines: 50-79
Part 3a: Pre-processing
Federalist Paper 1: Before
Federalist Paper 1: After
Code Lines: 71-88
Part 3b: Additional Pre-processing
Federalist Paper 1: After
Code Lines: 89-104
Part 4a: Document-Term Matrices
Code Lines: 142-149
Part 4b: Word CloudCode Lines: 151-165
1:30pm
Part 4c: Term FrequenciesCode Lines: 167-171
Part 4d: Word AssociationsCode Lines: 173-188
Part 4e: Word Clustering: Hierarchal
Code Lines: 189-201
Part 4f: K-Means Word Clustering
Code Lines: 202-207
Part 5: Redo with Stemming, Bigrams and additional stop words
Uncomment (CTRL + SHIFT + C) and run lines 107-139
Code Lines: 107-139then rerun lines 141-206
Classification Models (Overview)• Classification models predict class labels
•Class labels = categories• For example, binary (yes or no), ordinal (high, medium, low) or
nominal (dog, cat, kangaroo)
• Classification models are a type of supervised learning as the class labels (“y variables”) are known (observed).
•Determining the disputed Federalist papers is a binary classification problem as the author of the distputed papers is one of two authors: Hamilton or Madison.
1:50pm - 2pm
Types of Classification Models• There are many different models (algorithms) that can be used for classification problems.
• Examples: Logistic Regression, Decision Tree, Support Vector Machine, Neural Networks
•We are going to use Naïve Bayes and k-nearest neighbors.
•We will use different feature variables (X variables)• Naïve Bayes = Word Presence (1/0) as X Variables• k-Nearest Neighbors = Topic Probabilities as X Variables
Naïve Bayes• Naïve Bayes is an algorithm based on Bayes Theorem (conditional probability)
• Updates the probability of the predicted class (e.g. who is the author) based on words found in the class (author’s papers).• Example in Spam Filters: The word “Viagra” increases odds an
email is spam
• Pro: Simple, can handle many features (x variables) • Con: Difficult to interpret, subject to assumptions (e.g. independence of x variables)
• See these slides for a deeper overview of Naïve Bayes and its assumptions.
Basics of Predictive Modeling• In predictive modeling, datasets are divided into training and test (sometimes called validation)
• Federalist Papers:• Training Dataset = 65 papers* with known author (known label)• Test Dataset = 12 papers with disputed author (missing label)*Excludes papers written by John Jay (five) and written by both Madison & Hamilton (three)
• Our objective is to build a model that successfully predicts the training dataset authors (accuracy).
• After building the model, apply it to the test dataset to predict the authors for the 12 disputed papers.
Part 6: Naïve Bayes Pre-Processing
Code Lines: 208-219
Conditional Probabilities
Update
Code Lines: 231-241
Odds RatiosCode Lines: 242-248
Train Naïve Bayes and Predict Training Dataset
Code Lines: 250-273
Predict Test (Disputed) Dataset
Code Lines: 275-290
Part 7: Running Topic Modeling
This will take about 4 mins, depending on the computer you run it on
Code Lines: 295-308
Topic Modeling Overview
Source: David Blei (link to article)
Create LDAVis Tool & Label Topics
Code Lines: 295-308
LDAVis Package to Visualize Topics
Index.html file in the “Federalist” folder in your working directory. Open with FireFox; it is not supported by Chrome or IE.
Topic Clustering Heatmap via R Shiny
Code Lines: 321-349
Topic Clustering (Plot)
Nearest NeighborsCode Lines: 350-370
Predictions
• Naïve Bayes predicts 9 of the 12 papers as written by Madison.
• K-NN predicts only 4 of the 12 papers as written by Madison
• Why? How stable are these results??
Code Lines: 371-373
Beige Book & Sentiment Analysis
2:30pm
Sentiment Analysis• Two main types of textual information. • Facts and Opinions
• Search engines are optimized for facts.
• Sentiment Analysis is a growing attempt (not completely solved) to optimize the discovery of opinions.
•Opinions Mining or Sentiment Analysis is an attempt to recognize the opinion or sentiment that a person holds toward an object.
Source: Richard Heimann
Where do we find sentiment?•Movie / Books: Are the reviews on this movie/book positive/negative?
• Product Sales: What is thought of the new iPhone?
• Public Sentiment: How do consumers feel about the economy? How is consumer sentiment effecting sales by sector?
• Politics: How are voters polarized, if at all around a candidate or policy?
• Prediction: Stock Prices, Election Outcomes, Market Trends, Product Sales
Source: Richard Heimann
Three Types of Sentiment Analysis•Dictionary Based Sentiment Analysis • i.e. Is an attitude toward an object positive or negative? • Build dictionary of positive / negative words and count
net occurrence• Supervised Learning for Sentiment Analysis. • i.e. Given data we have seen in the past, can we predict
class assignment for our polarity measure (positive/neutral/negative) • e.g. Naive Bayes, MaxEnt, SVM
•Unsupervised Sentiment Analysis • i.e. No dictionaries. No labeled data. No training
algorithms. And, scale words (often bi-grams) and users on a single dimension. • e.g. latent variable models – Item Response Theory (IRT)
Source: Richard Heimann
Federal Reserve Beige Book• The Beige Book is a report published by the United States Federal Research Board (FRB) eight times a year. • The Beige Book has been in publication since 1985 and is now published online. • The report is published by each (n=12) of the Federal Reserve Bank districts.• The content is rather anecdotal. The report interviews key business contacts, economists, market experts, and others to get their opinion about the economy. • The data used in this book can be found on GitHub, as well as the Python code for all the scraping and parsing.
Source: Richard Heimann
Beige Book Case Study: Initial Steps• Step 1: Download SentimentBeigeBook.zip from https://github.com/wesslen/BeigeBookSentimentAnalysis
• Step 2: Save into a local directory. Open R Studio
• Step 3: Open “sentiment_analysis.R” and “sentiment.R”
Step 1: Working Directory• Run “sentiment.R”. This counts the net number of positive minus negative words in the document given the sentiment (lexicon) dictionary. It will be used later on.
• Set working directory based on where you downloaded the zip file contents.
•Note: For Windows, use “C:\Directory\Folder\” formatting
Step 2: Import in Dictionaries
Step 3: Import Corpus / Text
Step 4: Pre-Processing
Step 5: Create corpus & tokenize
Step 6: Stemming & Stop Words
Step 7: Term-Document Matrix
Step 8: Explore Common Words
Step 9: Add more pos/neg words
Step 10: Word Associations
Step 11: Word Cloud
Step 12: Sentiment Scoring
First six records of BB.sentiment
Step 13: Normalizing Scores
First six records of BB.sentiment (updated)
Step 14: Score Histograms
Raw Scored Sentiment
Scaled Scored Sentiment
Step 15: Plot Historical Sentiment
Step 16: Run beigebookplots.R
Concluding Thoughts• Bag of words approach is a simple text mining framework
• Works well for exploratory analysis, classification and basic sentiment analysis.• Deeper models are needed to identify semantic meaning (e.g. GloVe, recurrent
neural networks, see Stanford Deep Learning NLP class materials)
• R is a great tool for simple, visual-based text mining• However, R has limitations (scale, functions, etc.)• Python (nltk) and Java are better for large-scale, PhD-dissertation research
• Text mining is an iterative process• There is not a single model or method that always works – depends on context!• If your initial results are vague, enhance pre-processing
• e.g. remove more stop words, try bi/trigrams, try stemming or lemmas, customize lexicon
• Text mining is an art, not a science.• Need domain experience AND algorithms• If you’re a social scientist, make friends with a computer scientist (and vice
versa).
Project Mosaic & Next Workshop• Project Mosaic offers consulting, workshops and other collaborative research opportunities.• Upcoming Workshops:
https://projectmosaic.uncc.edu/events-list/
•Next month workshop on Text Mining for Twitter• Will include reference to SOPHI, UNCC Data Science
Initiative’s data warehouse that includes GNIP access to historical Twitter data.• If you are planning on attending, please register for
credentials for a Twitter API before the workshop.• Follow these instructions (to set up with R connector): http
://www.r-bloggers.com/setting-up-the-twitter-r-package-for-text-analytics/
Proprietary Text Mining Tools• AlchemyAPI • limited free use
• Taste Analytics Signals • Two week free trial
• SAS Enterprise Miner • student version available via UNCC
• SAS Sentiment Analysis • available on some UNCC cpu’s
Hamilton Soundtrack Amazon Reviews
Open Source Text Mining Tools• R tm package• Great for simple analysis but difficult for more complex
analysis.
• Python nltk package• Probably one of the best open source text mining packages
• Python gensim package • Another fantastic Python package – focuses on Topic modeling
•Mallet • Great NLP toolkit but requires background in Java and
command line
Online Text / NLP Courses• Introductory / Intermediate:• Dan Juravsky (Stanford),
Introductory Text Mining Class• Chris Manning / Dan Juravsky (Standford),
Coursera Natural Language Processing Class• ChengXiang Zhai (Univ Illinois Champaign Urbana),
Coursera Text Mining & Analytics Course
• Advanced (but way cool, cutting edge stuff):• Richard Socher (Stanford),
Deep Learning for Natural Language Processing
Blogs• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
• http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis
• https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
• http://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/
•Want more? Follow this link for all R “text” blogs on Rbloggers website