Upload
open-analytics
View
2.227
Download
1
Tags:
Embed Size (px)
Citation preview
Joe OlsonData ArchitectSmart Chicago Collaborative27 Mar [email protected]
(All the cool buzzwords in one place!)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics
Social Media - Twitter
• What can we learn from Twitter?• 400 million tweets per day
source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
• 218 million users source: http://techcrunch.com/2013/10/03/bweeting/
• Excellent source of sentiment
• Excellent source of big data• Prototyping
• Modeling natural language
• Resume padding
Social Media - Twitter
• How do we get at the data?
• Twitter provided APIs:
• https://dev.twitter.com/docs
• Streaming
• Set up a real time data stream (json) based on keywords
• REST (v1.1)
• Make REST requests, and get results
• Possible parameters:
• Geospatial bounding box
• By time
• By user, hashtag, retweets etc
• Fire hose
• Big $$$. Big data
Social Media - Twitter
• Information & Obstacles
• Who
• What
• At best: Plain English (!)
• Worse: (Spanish or Arabic or Portuguese...)
• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.
• Absolute Worst: combination of all of them
• Where
• 1-2% with latitude / longitude
• Geocode
• When
Social Media - Twitter
JSON Tweet example:• "created_at":"Sun Oct 27 13:57:40 +0000 2013",• "id":394462908261740540,• "text":"Flu :(",• "source":"<a href=\"http://twitmania.com\" rel=\"nofollow\">TwitMania™</a>",• "user":{• "id":594141140,• "name":"Yultiana Farida N",• "screen_name":"yultiana",• "followers_count":231,• "friends_count":252,• "created_at":"Tue May 29 23:58:25 +0000 2012",• "statuses_count":2397,• },• "geo":null,
Cloud Computing
• What does cloud computing bring to the table?
• Amazon’s EC2:
• Commoditized hardware
• Low cost
• Only charged for resources you use
• No long term commitments
• Scalable
• "Throwaway" mentality
**IF** you play by their rules!
Cloud Computing –AWS
• Tools• Virtual Machines
• # of Processors, RAM, OS, disk capacity and I/O – all configurable• Price range: $.02/hr - $4.60/hr• Licensed OSes cost 50% more than Linux OSes
• Archive Storage• S3 / Glacier
• Work Queues• SQS
• Data Stores• Dynamo (key value store), Red Shift (analysis store)
• Virtual Networking• Routers, VPN gateways, access control lists, etc
• APIs• Command line• HTTPS REST• Native programming languages (Python, bash, PHP, Java etc.)
Ideal for rapid prototyping / proof of concepts
Cloud Computing –AWS
• APIs
• Basic
• Start an instance (and start billing)
• Stop an instance (stop billing)
• Insert item into queue
• Remove item from queue
• Write to backup store
• Ultra advanced
• Reserved vs. on demand vs. spot instances
• Price can drop as much as 80% due to market demand
• Instance can disappear at any time
Big Data Analytics
• Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set?
• Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce?
• Possible Enrichment of raw data:• Classification – separate tweets into “relevant” and “irrelevant”
• Geocoding – improve on the 1-2% ?
• Aggregation –> map reduce• Mapping -> Reduce Function -> Output
• AWS – Elastic Map Reduce
• Clustering
Machine Learning
• Classification: relevant, or irrelevant?
• Human trained model
• Once model is established, bounce new data off it for classification
• Validation of model
• Accuracy = (Total # of classifications – Mismatches between machine / human)
Total # of classifications
• Crowdsourcing – AWS Mechanical Turk
• Improve model by feeding disagreements back into the model
• Our best text classification model to date: low 90%
Open Source
• Friendly to the commoditized computing paradigm
• Don’t have to worry about licensing issues
• Contributes to the “throwaway” discipline
• Don’t have to re-invent the wheel (collaboration)
• Solutions applicable to all parts of the architecture
• Acquire data: Node.js – non blocking
• Analyze data: R – statistical engine
• Store and query data: MongoDB (document store) or Riak (key-value database)
Architecture
• We know Twitter is providing a mountain of data from all parts of the world
• We know Amazon is providing a framework of low cost, on-demand, no commitment computing
• Open source is providing a rich tool set
• Goals:• Architect with cost in mind!
• Enrichment - Real time and after-the-fact enrichment (open data)
• Scalable
• Decoupled
• Service based
• Rapid development
• Prove the concepts
Architecture - Acquire
• Acquire the data from Twitter
• If classifying in real time:
• Store then classify?
• Classify then store?
• Tools
• Twitter streaming API
• Keywords
• Node.js
• Several different packages to interface with Twitter APIs
• Amazon
• EC2
• SQS (?) Extremely useful, but drives the cost up
Architecture - Analyze
• Classification interface
• Service based – HTTP REST
• Push or pull?
• Push – classifiers listen on port 80
• Pull – classifier starts pulling from an established work queue
• Both highly scalable and flexible with respect to cost.
• Stateless
• R
• Human trained machine learning packages available
• Cloud friendly – no licenses
• Automatable – from install, configuration, execution
Architecture - Store
• Store JSON as an object (document store) or normalize (relational database)?
• Relational databases
• disk I/O intensive – not cloud friendly
• allow complex indexing
• Easy to get a business intelligence front end on them
• Requires a schema / ETL
• Key-value document stores
• Designed to be scalable – doesn’t need fast disks
• Indexing is not nearly as flexible as RDBMS
• More difficult to front a UI – no “drag and drop” tools
• No schema / ETL needed.
• Not as mature
• MongoDB / Riak
Architecture – Presentation
• Least need for cloud friendly scalability here?
• Options
• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho
• Open source BI software – SpagoBI
• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc
• Connect to an existing system instead?
Costs –Real Time Classification
• Number of tweets collected per day: 1,000,000 (comfortable - .25%)• Machine used on EC2 to acquire (node.js): micro
• $.02/hr * 24 hrs = .48/day
• Machine used on EC2 to classify (R): small (x2)• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day
• Machine used on EC2 to store (MongoDB): large• $.24/hr * 24 hrs = $5.76 /day
• Machine used on EC2 for GUI (Apache): small• $.06/hr * 24 = $1.44•
$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet
Can add more zeros if you relax real-time classification (spot instances)
Costs - Archive
• Size of average tweet: 2.5 KB
• Cost to archive:
• s3 : .095 GB/month
• 0.0000002 per tweet per month
• Glacier: .01 GB/month
• 0.00000002 per tweet per month
• Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.
Use Cases• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)
• Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative
• Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API)
• Once in the 311 database, it follows established City workflows, and becomes actionable
• Numbers (1 year):• 2,390 tweets classified as related to food poisoning• 282 tweets responded to• 205 reports submitted• 145 inspections
• Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead
• “U of Chicago releases a new paper on the effects of food poisoning”http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning
• Video: http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
Use Cases
• Disease Tracker
• Large scale attempt to track disease occurrences in the United States.
• Sponsored by the Dept. of HHS
• Approximately 1 million tweets a day (cold, flu) classified in real time
• EC2 scalable instances
• Geolocation
• Cost to run for 6 months: $850
Future Directions
• Turnkey service
• Can all this functionality be abstracted down to a pushbutton service?
• Open data
• Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well?
• General purpose bridge between Twitter and issue tracking databases
• Big industry problem
Github Sources
• Tweet Collector
• https://github.com/smartchicago/TweetCollector
• Classifier Code
• https://github.com/corynissen/foodborne_classifier