Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Joe OlsonData ArchitectSmart Chicago Collaborative27 Mar [email protected]

(All the cool buzzwords in one place!)

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics

Social Media - Twitter

• What can we learn from Twitter?• 400 million tweets per day

source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter

• 218 million users source: http://techcrunch.com/2013/10/03/bweeting/

• Excellent source of sentiment

• Excellent source of big data• Prototyping

• Modeling natural language

• Resume padding

http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter

http://techcrunch.com/2013/10/03/bweeting/


• How do we get at the data?

• Twitter provided APIs:

• https://dev.twitter.com/docs

• Streaming

• Set up a real time data stream (json) based on keywords

• REST (v1.1)

• Make REST requests, and get results

• Possible parameters:

• Geospatial bounding box

• By time

• By user, hashtag, retweets etc

• Fire hose

• Big $$$. Big data

https://dev.twitter.com/docs


• Information & Obstacles

• Who

• What

• At best: Plain English (!)

• Worse: (Spanish or Arabic or Portuguese...)

• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.

• Absolute Worst: combination of all of them

• Where

• 1-2% with latitude / longitude

• Geocode

• When


JSON Tweet example:• "created_at":"Sun Oct 27 13:57:40 +0000 2013",• "id":394462908261740540,• "text":"Flu :(",• "source":"<a href=\"http://twitmania.com\" rel=\"nofollow\">TwitMania™</a>",• "user":{• "id":594141140,• "name":"Yultiana Farida N",• "screen_name":"yultiana",• "followers_count":231,• "friends_count":252,• "created_at":"Tue May 29 23:58:25 +0000 2012",• "statuses_count":2397,• },• "geo":null,

Cloud Computing

• What does cloud computing bring to the table?

• Amazon’s EC2:

• Commoditized hardware

• Low cost

• Only charged for resources you use

• No long term commitments

• Scalable

• "Throwaway" mentality

**IF** you play by their rules!

Cloud Computing –AWS

• Tools• Virtual Machines

• # of Processors, RAM, OS, disk capacity and I/O – all configurable• Price range: $.02/hr - $4.60/hr• Licensed OSes cost 50% more than Linux OSes

• Archive Storage• S3 / Glacier

• Work Queues• SQS

• Data Stores• Dynamo (key value store), Red Shift (analysis store)

• Virtual Networking• Routers, VPN gateways, access control lists, etc

• APIs• Command line• HTTPS REST• Native programming languages (Python, bash, PHP, Java etc.)

Ideal for rapid prototyping / proof of concepts

Cloud Computing –AWS

• APIs

• Basic

• Start an instance (and start billing)

• Stop an instance (stop billing)

• Insert item into queue

• Remove item from queue

• Write to backup store

• Ultra advanced

• Reserved vs. on demand vs. spot instances

• Price can drop as much as 80% due to market demand

• Instance can disappear at any time

Big Data Analytics

• Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set?

• Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce?

• Possible Enrichment of raw data:• Classification – separate tweets into “relevant” and “irrelevant”

• Geocoding – improve on the 1-2% ?

• Aggregation –> map reduce• Mapping -> Reduce Function -> Output

• AWS – Elastic Map Reduce

• Clustering

Machine Learning

• Classification: relevant, or irrelevant?

• Human trained model

• Once model is established, bounce new data off it for classification

• Validation of model

• Accuracy = (Total # of classifications – Mismatches between machine / human)

Total # of classifications

• Crowdsourcing – AWS Mechanical Turk

• Improve model by feeding disagreements back into the model

• Our best text classification model to date: low 90%

Open Source

• Friendly to the commoditized computing paradigm

• Don’t have to worry about licensing issues

• Contributes to the “throwaway” discipline

• Don’t have to re-invent the wheel (collaboration)

• Solutions applicable to all parts of the architecture

• Acquire data: Node.js – non blocking

• Analyze data: R – statistical engine

• Store and query data: MongoDB (document store) or Riak (key-value database)

Architecture

• We know Twitter is providing a mountain of data from all parts of the world

• We know Amazon is providing a framework of low cost, on-demand, no commitment computing

• Open source is providing a rich tool set

• Goals:• Architect with cost in mind!

• Enrichment - Real time and after-the-fact enrichment (open data)

• Scalable

• Decoupled

• Service based

• Rapid development

• Prove the concepts

Architecture - Acquire

• Acquire the data from Twitter

• If classifying in real time:

• Store then classify?

• Classify then store?

• Tools

• Twitter streaming API

• Keywords

• Node.js

• Several different packages to interface with Twitter APIs

• Amazon

• EC2

• SQS (?) Extremely useful, but drives the cost up

Architecture - Analyze

• Classification interface

• Service based – HTTP REST

• Push or pull?

• Push – classifiers listen on port 80

• Pull – classifier starts pulling from an established work queue

• Both highly scalable and flexible with respect to cost.

• Stateless

• R

• Human trained machine learning packages available

• Cloud friendly – no licenses

• Automatable – from install, configuration, execution

Architecture - Store

• Store JSON as an object (document store) or normalize (relational database)?

• Relational databases

• disk I/O intensive – not cloud friendly

• allow complex indexing

• Easy to get a business intelligence front end on them

• Requires a schema / ETL

• Key-value document stores

• Designed to be scalable – doesn’t need fast disks

• Indexing is not nearly as flexible as RDBMS

• More difficult to front a UI – no “drag and drop” tools

• No schema / ETL needed.

• Not as mature

• MongoDB / Riak

Architecture – Presentation

• Least need for cloud friendly scalability here?

• Options

• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho

• Open source BI software – SpagoBI

• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc

• Connect to an existing system instead?

Costs –Real Time Classification

• Number of tweets collected per day: 1,000,000 (comfortable - .25%)• Machine used on EC2 to acquire (node.js): micro

• $.02/hr * 24 hrs = .48/day

• Machine used on EC2 to classify (R): small (x2)• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day

• Machine used on EC2 to store (MongoDB): large• $.24/hr * 24 hrs = $5.76 /day

• Machine used on EC2 for GUI (Apache): small• $.06/hr * 24 = $1.44•

$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet

Can add more zeros if you relax real-time classification (spot instances)

Costs - Archive

• Size of average tweet: 2.5 KB

• Cost to archive:

• s3 : .095 GB/month

• 0.0000002 per tweet per month

• Glacier: .01 GB/month

• 0.00000002 per tweet per month

• Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.

Use Cases• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)

• Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative

• Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API)

• Once in the 311 database, it follows established City workflows, and becomes actionable

• Numbers (1 year):• 2,390 tweets classified as related to food poisoning• 282 tweets responded to• 205 reports submitted• 145 inspections

• Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!”

http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead

• “U of Chicago releases a new paper on the effects of food poisoning”http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning

• Video: http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be

http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead

http://174.129.49.183/cgi-bin/R/fp_classifier?text=U of Chicago releases new paper on the effects of food poisoning

http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be

Use Cases

• Disease Tracker

• Large scale attempt to track disease occurrences in the United States.

• Sponsored by the Dept. of HHS

• Approximately 1 million tweets a day (cold, flu) classified in real time

• EC2 scalable instances

• Geolocation

• Cost to run for 6 months: $850

Future Directions

• Turnkey service

• Can all this functionality be abstracted down to a pushbutton service?

• Open data

• Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well?

• General purpose bridge between Twitter and issue tracking databases

• Big industry problem

Github Sources

• Tweet Collector

• https://github.com/smartchicago/TweetCollector

• Classifier Code

• https://github.com/corynissen/foodborne_classifier

https://github.com/smartchicago/TweetCollector

https://github.com/corynissen/foodborne_classifier

Technology

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)