Upload
javier-ramirez
View
358
Download
2
Embed Size (px)
DESCRIPTION
This is the story of how https://teowaki.com added bigdata analytics using a very economic approach. Bigdata is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don't have the money, you are losing all the fun. In my talk I will show you how you can use Redis, Google Bigquery and Apps Script to manage big data from your application for under $1 per month. Don't you feel like running a RegExp over 300 million rows in just 5 seconds?
Citation preview
javier ramirez@supercoco9
BigData for small pocketsRedis, BigQuery and Apps Scripts
RubyC Kiev 2014
moral of the story
you can do big, if you know how
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
REST API (Ruby on Rails) +
Web on top (AngularJS)
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
data that’s an order of magnitude greater than data you’re accustomed to
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
Doug Laney VP Research, Business Analytics and Performance Management at Gartner
data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.
Ed Dumbill program chair for the O’Reilly Strata Conference
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
Javier Ramirezimpresionable teowaki founder
1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets and hyperloglogs.
http://redis.io
started in 2009 by Salvatore Sanfilippo @antirez
111 contributors at https://github.com/antirez/redis
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
ntopngLogStash
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q
SET: 552,028 requests per secondGET: 707,463 requests per secondLPUSH: 767,459 requests per secondLPOP: 770,119 requests per second
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining)$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q
SET: 122,556 requests per secondGET: 123,601 requests per secondLPUSH: 136,752 requests per secondLPOP: 132,424 requests per second
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Non intrusive metrics
Capture data really fast.
Then send the data on the background
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
Redis keeps
everything in memory all the time
javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Gzip to AWS S3/Glacier
orGoogle Cloud Storage
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Hadoop (map/reduce)
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
http://hadoop.apache.org/
started in 2005 by Doug Cutting and Mike Cafarella
cassandra
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
http://cassandra.apache.org/
released in 2008 by facebook.
other big data solutions:
hadoop+voldemort+kafka
hbase
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
http://engineering.linkedin.com/projects
http://hbase.apache.org/
Amazon Redshift
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
http://aws.amazon.com/redshift/
but...
not interactive enough
hard to set up and monitor
expensive cluster
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Based on “Dremel”
Specifically designed for interactive queries over petabytes of real-time data
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
loading data
You just send the data intext (or JSON) format
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
SQL
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
select name from USERS order by date;
select count(*) from users;
select max(date) from USERS;
select sum(total) from ORDERS group by user;
specific extensions for analytics
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
withinflattennest
stddev
topfirstlastnth
variance
var_popvar_samp
covar_popcovar_samp
quantiles
Things you always wanted to try but were too scared to
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
select count(*) from publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
columnar storage
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
highly distributed execution using a tree
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
web console screenshot
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
window functions
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
our most active user
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
country segmented traffic
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
10 request we should be caching
correlations.
not to mistake with causality
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014
5 most created resources
select uri, count(*) total from stats where method = 'POST' group by URI;
javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014
...but
/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links
javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014
5 most created resources
new users per month
SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (
SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url
)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25
gem install bigbroda
https://github.com/michelson/BigBroda
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
bigquery pricing
$26 per stored TB1000000 rows => $0.00416 / month
£0.00243 / month
$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
£0.054307 / month*
per 1MM rows
*the 1st 100GB every month are free of charge
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
ig
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks!спасибі
Javier Ramírez@supercoco9
rubyc kiev 14