Upload
matthew-russell
View
1.172
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Sunday 9:55 a.m.–10:45 a.m. Why Twitter Is All the Rage: A Data Miner's Perspective Presenter: Matthew Russell Audience level: Novice Description: In order to be successful, technology must amplify a meaningful aspect of our human experience, and Twitter’s success largely has been dependent on its ability to do this quite well. Although you could describe Twitter as just a “free, high-speed, global text-messaging service,” that would be to miss the much larger point that Twitter scratches some of the most fundamental itches of our humanity. Abstract: This talk explains explains why Twitter is "all the rage" by examining Twitter in light of fundamental questions about our humanity: * We want to be heard * We want to satisfy our curiosity * We want it easy * We want it now This session examines Twitter's ability to examine these questions and presents its underlying conceptual architecture as an interest graph. Even if you have minimal programming skills, you'll come away empowered with the ability to think about data mining on Twitter in more effective ways and apply a powerful collection of easily adaptable recipes to fully exploit the 5 kilobytes of metadata that decorates those 140 characters that you commonly think of as a tweet. Learn how to access Twitter's API, search for tweets, discover trending topics, process tweets in real-time from the firehose, and much more.
Citation preview
Why Twitter Is All The Rage:A Data Miner's PerspectiveMatthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
PyTN - 23 February 2014
1
Overview
Intro
Twitter as a Platform for Data Science
Applications of Firehose Analysis (#Syria circa last)
Understanding the Amazon Prime Air Reaction (IPython Notebook Walk Through)
Q&A
2
Intro
3
Hello, My Name Is ... Matthew
4
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
Transforming Curiosity Into Insight
5
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding templates for data science experiments
Think of the book as "premium" support for the OSS project
Mining the Social Web ToCChapter 1 - Mining Twitter
Chapter 2 - Mining Facebook
Chapter 3 - Mining LinkedIn
Chapter 4 - Mining Google+
Chapter 5 - Mining Web Pages
Chapter 6 - Mining Mailboxes
Chapter 7 - Mining GitHub
Chapter 8 - Mining the Semantically Marked-Up Web
Chapter 9 - Twitter Cookbook
6
Anatomy of Each ChapterBrief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
7
Opportunities for Data Alchemy
A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprints of the "real world" are accumulating
Lots of opportunities for motivated Python hackers
"Software is eating the world"
8
Social Media Is All the Rage
World population: 7B people
Facebook: 1B+ users
Twitter: 650M users
Google+ 500M users
LinkedIn: 260M users
250M+ blogs (conservatively?)
9
But what does it all mean, Basil?
It's a platform for data science and the frontier for predictive analytics
Understanding world events
Swaying political elections
Modeling human behavior
Analyzing sentiment
Making intelligent recommendations
10
Twitter & Data Science
11
Data Science
12
Data => Actionable information
Highly interdisciplinary
Nascent
Necessary
http://wikipedia.org/wiki/Data_science
Another View of Data Science
13
14
Twitter Is All the Rage
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We want it easy
We want it now
Accessible, rich, and (mostly) "open" data
RESTful APIs and JSON responses
Great proving ground for predictive analytics about the real world
15
Twitter's Network Dynamics
~650M curious users
A collective consciousness
Real-time communication
Short, sweet, ... and fast
Asymmetric Following Model
An interest graph
16
Twitter Primitives
17
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
Twitter and Facebook Compared
18
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
Accounts Types: People & Pages
Mutual Connections
"Likes"
"Shares"
"Comments"
Extensive Privacy Controls
What's in a Tweet?
19
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
20
API RequestsRESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
21
Data Mining: Low Hanging Fruit
"Know thy data..."
Start with simple stats:
Count
Compare
Filter
Rank
Then, apply more complex analyses
22
A Starting Point: Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
23
24
Example: Histogram of Retweets
25
Roberto Mercedes
Jorge
Ana
Nina
Social Network Mechanics
Interest Graph Mechanics
26
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan Luis
Guerra
Juan Luís
Guerra
A (Social) Interest Graph
27
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan Luis
Guerra
Juan Luís
Guerra
A (Political) Interest Graph
28
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
Measuring Influence Is Tricker Than It Looks
29
Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all
Inactive or abandoned accounts that can’t influence or be influenced since they are not in use
Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero
The network effects of retweets by accounts that are active and can be influenced to spread a message
See also http://wp.me/p3QiJd-2a
Justin Bieber vs Tea Party
30
Realtime Analysis: #Syria
31
Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume of data can be considerable
Fuller analysis at http://wp.me/p3QiJd-1I
#Syria: Who?
32
See http://wp.me/p3QiJd-1I
#Syria: Who?
33
See http://wp.me/p3QiJd-1I
#Syria: Who?
34
See http://wp.me/p3QiJd-1I
#Syria: What?
35
See http://wp.me/p3QiJd-1I
#Syria: What?
36
See http://wp.me/p3QiJd-1I
#Syria: Where?
37
See http://wp.me/p3QiJd-1I
#Syria: When?
38
See http://wp.me/p3QiJd-1I
#Syria: Why?
39
That's for you (as the data scientist) to decide
Quantitative automation can amplify human intelligence
Qualitative analysis is still requires human intelligence
Twitter Firehose Analysis with pandas
40
MTSW Virtual Machine Experience
Goal: Make it easy to transform curiosity into insight
Vagrant-based virtual machine
Virtualbox or AWS
IPython Notebook User Experience
Point-and-click GUI
100+ turn-key examples and templates
Social web mining for the masses
41
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
42
Goals
To understand how to capture data from Twitter's firehose
A understand basic pandas usage for tweets
To work through a data science experiment with a systematic 4-step process
To better understand the emotional reaction to the Amazon Prime Air announcement
To introduce some tools for data science
43
Useful Links
Website
http://MiningTheSocialWeb.com
Twitter Data Mining Round Up
http://wp.me/p3QiJd-5H
All Source Code in IPython Notebook format (GitHub)
http://bit.ly/MiningTheSocialWeb2E
44
Q&A
45