70
javier ramirez @supercoco9 BigData for small pockets Redis, BigQuery and Apps Scripts RubyC Kiev 2014

Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Embed Size (px)

DESCRIPTION

This is the story of how https://teowaki.com added bigdata analytics using a very economic approach. Bigdata is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don't have the money, you are losing all the fun. In my talk I will show you how you can use Redis, Google Bigquery and Apps Script to manage big data from your application for under $1 per month. Don't you feel like running a RegExp over 300 million rows in just 5 seconds?

Citation preview

Page 1: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez@supercoco9

BigData for small pocketsRedis, BigQuery and Apps Scripts

RubyC Kiev 2014

Page 2: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 3: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 4: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 5: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 6: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 7: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 8: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

moral of the story

you can do big, if you know how

Page 9: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 10: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 11: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

REST API (Ruby on Rails) +

Web on top (AngularJS)

Page 12: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 13: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

data that’s an order of magnitude greater than data you’re accustomed to

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Doug Laney VP Research, Business Analytics and Performance Management at Gartner

Page 14: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.

Ed Dumbill program chair for the O’Reilly Strata Conference

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 15: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Javier Ramirezimpresionable teowaki founder

Page 16: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 17: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 18: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets and hyperloglogs.

http://redis.io

started in 2009 by Salvatore Sanfilippo @antirez

111 contributors at https://github.com/antirez/redis

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 19: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

twitterstackoverflowpinterestbooking.comWorld of WarcraftYouPornHipChatSnapchat

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

ntopngLogStash

Page 20: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining)

$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q

SET: 552,028 requests per secondGET: 707,463 requests per secondLPUSH: 767,459 requests per secondLPOP: 770,119 requests per second

Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining)$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q

SET: 122,556 requests per secondGET: 123,601 requests per secondLPUSH: 136,752 requests per secondLPOP: 132,424 requests per second

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 21: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Non intrusive metrics

Capture data really fast.

Then send the data on the background

Page 22: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 23: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Redis keeps

everything in memory all the time

javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014

Page 24: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 25: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Gzip to AWS S3/Glacier

orGoogle Cloud Storage

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 26: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 27: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Hadoop (map/reduce)

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

http://hadoop.apache.org/

started in 2005 by Doug Cutting and Mike Cafarella

Page 28: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

cassandra

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

http://cassandra.apache.org/

released in 2008 by facebook.

Page 29: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

other big data solutions:

hadoop+voldemort+kafka

hbase

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

http://engineering.linkedin.com/projects

http://hbase.apache.org/

Page 30: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Amazon Redshift

javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

http://aws.amazon.com/redshift/

Page 31: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

but...

not interactive enough

hard to set up and monitor

expensive cluster

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 32: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 33: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Based on “Dremel”

Specifically designed for interactive queries over petabytes of real-time data

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 34: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

loading data

You just send the data intext (or JSON) format

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 35: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

SQL

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

select name from USERS order by date;

select count(*) from users;

select max(date) from USERS;

select sum(total) from ORDERS group by user;

Page 36: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

specific extensions for analytics

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

withinflattennest

stddev

topfirstlastnth

variance

var_popvar_samp

covar_popcovar_samp

quantiles

Page 37: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Things you always wanted to try but were too scared to

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

select count(*) from publicdata:samples.wikipedia

where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Page 38: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

columnar storage

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 39: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 40: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

web console screenshot

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 41: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

window functions

Page 42: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

our most active user

Page 43: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

country segmented traffic

Page 44: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

10 request we should be caching

Page 45: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

correlations.

not to mistake with causality

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 46: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014

5 most created resources

select uri, count(*) total from stats where method = 'POST' group by URI;

Page 47: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014

...but

/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links

Page 48: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014

5 most created resources

Page 49: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

new users per month

Page 51: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (

SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url

)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25

Page 52: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

gem install bigbroda

https://github.com/michelson/BigBroda

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 53: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 54: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 55: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 56: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 57: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 58: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 59: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 60: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 61: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 62: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 63: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 64: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 65: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014
Page 66: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

bigquery pricing

$26 per stored TB1000000 rows => $0.00416 / month

£0.00243 / month

$5 per processed TB1 full scan = 160 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB100 GB => $0.05 / month £0.03

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 67: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

£0.054307 / month*

per 1MM rows

*the 1st 100GB every month are free of charge

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 68: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

1. non intrusive metrics2. keep the history3. avoid vendor lock-in4. interactive queries5. cheap6. extra ball: real time

javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

Page 69: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

ig

Page 70: Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks!спасибі

Javier Ramírez@supercoco9

rubyc kiev 14