71

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Embed Size (px)

DESCRIPTION

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani

Citation preview

Page 1: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 2: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Crunching Data with BigQuery Fast analysis of Big Data

Jordan Tigani, Software Engineer

Page 3: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

01000001011011100111001101110111011001010111001

00010000001110100011011110010000001110100011010

00011001010010000001010101011011000111010001101

00101101101011000010111010001100101001000000101

00010111010101100101011100110111010001101001011

01111011011100010000001101111011001100010000001

00110001101001011001100110010100101100001000000

11101000110100001100101001000000101010101101110

01101001011101100110010101110010011100110110010

10010110000100000011000010110111001100100001000

00010001010111011001100101011100100111100101110

100101110011001000000011010000110010...........

Page 4: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Big Data at Google

72 hours

100 million gigabytes

Page 5: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT

kick_ass_product_plan AS strategy,

AVG(kicking_factor) AS awesomeness

FROM

lots_of_data

GROUP BY

strategy

Page 6: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

+-------------+----------------+

| strategy | awesomeness |

+-------------+----------------+

| "Forty-two" | 1000000.01 |

+-------------+----------------+

1 row in result set (10.2 s)

Scanned 100GB

Page 7: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 8: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 9: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Regular expressions on 13 billion rows...

Page 10: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

13 Billion rows

1 TB of data in 4 tables

FAST! AST

Page 11: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Google's Internal Technology:

Dremel

Page 12: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

MapReduce is Flexible but Heavy

Master

Mapper Mapper

• Master constructs the plan and

begins spinning up workers

Distributed Storage

• Mappers read and write to

distributed storage

• Map => Shuffle => Reduce

Reducer • Reducers read and write to

distributed storage

Page 13: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Master

Reducer

Mapper Mapper

Stage 2

MapReduce is Flexible but Heavy

Stage 1

Master

Mapper Mapper

Distributed Storage

Reducer

Page 14: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Dremel vs MapReduce

• MapReduce

o Flexible batch processing

o High overall throughput

o High latency

• Dremel

o Optimized for interactive SQL queries

o Very low latency

Page 15: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage

Dremel Architecture

• Columnar Storage

• Long lived shared serving tree

• Partial Reduction

• Diskless data flow

Page 16: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT

state, COUNT(*) count_babies

FROM [publicdata:samples.natality]

WHERE

year >= 1980 AND year < 1990

GROUP BY state

ORDER BY count_babies DESC

LIMIT 10

Simple Query

Page 17: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage SELECT state, year

O(Rows ~140M)

COUNT(*)

GROUP BY state

WHERE year >= 1980 and year < 1990

O(50 states)

LIMIT 10

ORDER BY count_babies DESC

COUNT(*)

GROUP BY state

COUNT(*)

GROUP BY state

O(50 states) O(50 states)

Page 18: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Modeling Data

Page 19: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Example: Daily Weather Station Data

weather_station_data

station lat long mean_temp humidity timestamp year month day

9384 33.57 86.75 89.3 .35 1351005129 2011 04 19

2857 36.77 119.72 78.5 .24 1351005135 2011 04 19

3475 40.77 73.98 68 .35 1351015930 2011 04 19

etc...

Page 20: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Example: Daily Weather Station Data

station, lat, long, mean_temp, year, mon, day

999999, 36.624, -116.023, 63.6, 2009, 10, 9

911904, 20.963, -156.675, 83.4, 2009, 10, 9

916890, -18133, 178433, 76.9, 2009, 10, 9

943320, -20678, 139488, 73.8, 2009, 10, 9

CSV

Page 21: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Organizing BigQuery Tables

Your Source

Data

October 22

October 23

October 24

Page 22: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 23: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Modeling Event Data: Social Music Store

logs.oct_24_2012_song_activities

USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP

Michael LISTEN Too Close Alex Clare 1351065562

Michael LISTEN Gangnam Style PSY 1351105150

Jim LISTEN Complications Deadmau5 1351075720

Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

logs.oct_24_2012_song_activities

USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP

Michael LISTEN Too Close Alex Clare 1351065562

Michael LISTEN Gangnam Style PSY 1351105150

Jim LISTEN Complications Deadmau5 1351075720

Michael PURCHASE 0.99 Gangnam Style PSY 1351115962

Page 24: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Users Who Listened to More than 10 Songs/Day

SELECT

UserId, COUNT(*) as ListenActivities

FROM

[logs.oct_24_2012_song_activities]

GROUP EACH BY

UserId

HAVING

ListenActivites > 10

Page 25: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

How Many Songs Listened to Total by Listeners of PSY?

SELECT

UserId, count(*) as ListenActivities

FROM

[logs.oct_24_2012_song_activities]

WHERE UserId IN (

SELECT

UserId

FROM

[logs.oct_24_2012_song_activities]

WHERE artist = 'PSY')

GROUP EACH BY UserId

HAVING

ListenActivites > 10

Page 26: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Modeling Event Data: Nested and Repeated Values

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

JSON

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

{"UserID" : "Michael",

"Listens": [

{"TrackId":1234,"Title":"Gangnam Style",

"Artist":"PSY","Timestamp":1351075700},

{"TrackId":1234,"Title":"Alex Clare",

"Artist":"Alex Clare",'Timestamp":1351075700}

]

"Purchases": [

{"Track":2345,"Title":"Gangnam Style",

"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}

]}

Page 27: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Which Users Have Listened to Beyonce?

SELECT

UserID,

COUNT(ListenActivities.artist) WITHIN RECORD

AS song_count

FROM

[logs.oct_24_2012_songactivities]

WHERE

UserID IN (SELECT UserID,

FROM [logs.oct_24_2012_songactivities]

WHERE ListenActivities.artist = 'Beyonce');

Page 28: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

What Position are PSY songs in our Users' Daily Playlists?

SELECT

UserID,

POSITION(ListenActivities.artist)

FROM

[sample_music_logs.oct_24_2012_songactivities]

WHERE

ListenActivities.artist = 'PSY';

Page 29: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT

AVG(POSITION(ListenActivities.artist))

FROM

[sample_music_logs.oct_24_2012_songactivities],

[sample_music_logs.oct_23_2012_songactivities],

/* etc... */

WHERE

ListenActivities.artist = 'PSY';

Average Position of Songs by PSY in All Daily Playlists?

Page 30: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Summary: Choosing a BigQuery Data Model

• "Shard" your Data Using Multiple Tables

• Source Data Files

• CSV format

• Newline-delimited JSON

• Using Nested and Repeated Records

• Simplify Some Types of Queries

• Often Matches Document Database Models

Page 31: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Developing with BigQuery

Page 32: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 33: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Google Cloud Storage

Upload Your Data

BigQuery

Page 34: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Load your Data into BigQuery

"jobReference":{

"projectId":"605902584318"},

"configuration":{

"load":{

"destinationTable":{

"projectId":"605902584318",

"datasetId":"my_dataset",

"tableId":"widget_sales"},

"sourceUris":[

"gs://widget-sales-data/2012080100.csv"],

"schema":{

"fields":[{

"name":"widget",

"type":"string"},

...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

"jobReference":{

"projectId":"605902584318"},

"configuration":{

"load":{

"destinationTable":{

"projectId":"605902584318",

"datasetId":"my_dataset",

"tableId":"widget_sales"},

"sourceUris":[

"gs://widget-sales-data/2012080100.csv"],

"schema":{

"fields":[{

"name":"widget",

"type":"string"},

...

Page 35: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Query Away!

"jobReference":{

"projectId":"605902584318",

"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

FROM widget_sales",

"maxResults":100,

"apiVersion":"v2"

}

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

"jobReference":{

"projectId":"605902584318",

"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count

FROM widget_sales",

"maxResults":100,

"apiVersion":"v2"

}

Page 36: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Libraries

• Python

• Java

• .NET

• Ruby

• JavaScript

• Go

• PHP

• Objective-C

Page 37: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Libraries - Example JavaScript Query

var request = gapi.client.bigquery.jobs.query({

'projectId': project_id,

'timeoutMs': '30000',

'query': 'SELECT state, AVG(mother_age) AS theav

FROM [publicdata:samples.natality]

WHERE year=2000 AND ever_born=1

GROUP BY state

ORDER BY theav DESC;'

});

request.execute(function(response) {

console.log(response);

$.each(response.result.rows, function(i, item) {

...

Page 38: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Custom Code and the Google Chart Tools API

Page 39: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Google Spreadsheets

Page 40: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Commercial Visualization Tools

Page 41: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Demo: Using BigQuery on BigQuery

Page 42: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

• Full table scans FAST

• Aggregate Queries on Massive Datasets

• Supports Flat and Nested/Repeated Data Models

• It's an API

BigQuery - Aggregate Big Data Analysis in Seconds

Get started now:

http://developers.google.com/bigquery/

Page 43: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT questions FROM audience

SELECT 'Thank You!'

FROM jordan

http://developers.google.com/bigquery

Page 44: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 45: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Schema definition

birth_record

parent_id_mother

parent_id_father

plurality

is_male

race

weight

parents

id

race

age

cigarette_use

state

Page 46: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Schema definition

birth_record

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

Page 47: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Tools to prepare your data

• App Engine MapReduce

• Commercial ETL tools

• Pervasive

• Informatica

• Talend

• UNIX command-line

Page 48: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Schema definition - sharding

birth_record_2011

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

birth_record_2012

mother_race

mother_age

mother_cigarette_use

mother_state

father_race

father_age

father_cigarette_use

father_state

plurality

is_male

race

weight

birth_record_2013

birth_record_2014

birth_record_2015

birth_record_2016

Page 49: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Visualizing your Data

Page 50: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

BigQuery architecture

Page 51: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

“ If you do a table scan over a 1TB table,

you're going to have a bad time. ”

Anonymous

16th century Italian Philosopher-Monk

Page 52: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

• Reading 1 TB/ second from disk:

• 10k+ disks

• Processing 1 TB / sec:

• 5k processors

Goal: Perform a 1 TB table scan in 1 second

Parallelize Parallelize Parallelize!

Page 53: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Data access: Column Store

Record Oriented Storage Column Oriented Storage

Page 54: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Distributed Storage (e.g. GFS)

BigQuery Architecture

Mixer 0

Mixer 1

Shard 0-8

Mixer 1

Shard 17-24

Mixer 1

Shard 9-16

Shard 0 Shard 10 Shard 12 Shard 24 Shard 20

Page 55: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Running your Queries

Page 56: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT COUNT(foo), MAX(foo), STDDEV(foo)

FROM ...

BigQuery SQL Example: Simple aggregates

Page 57: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT ... FROM ....

WHERE REGEXP_MATCH(url, "\.com$")

AND user CONTAINS 'test'

BigQuery SQL Example: Complex Processing

Page 58: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT COUNT(*) FROM

(SELECT foo ..... )

GROUP BY foo

BigQuery SQL Example: Nested SELECT

Page 59: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

BigQuery SQL Example: Small JOIN

SELECT huge_table.foo

FROM huge_table

JOIN small_table

ON small_table.foo = huge_table.foo

Page 60: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Distributed Storage (e.g. GFS)

BigQuery Architecture: Small Join

Mixer 0

Mixer 1

Shard 0-8

Mixer 1

Shard 17-24

Shard 0 Shard 24 Shard 20

Page 61: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Other new features!

Page 62: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Batch queries!

• Don't need interactive queries for some jobs?

• priority: "BATCH"

Page 63: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

• API

• Column-based datastore

• Full table scans FAST

• Aggregates

• Commercial tool support

• Use cases

That's it

Page 64: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

SELECT questions FROM audience

SELECT 'Thank You!'

FROM ryan

http://developers.google.com/bigquery

@ryguyrg http://profiles.google.com/ryan.boyd

Page 65: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Page 66: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Data access: Column Store

Record Oriented Storage Column Oriented Storage

Page 67: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

A Little Later ...

Row wp_namespace Revs

1 0 53697002

2 1 6151228

3 3 5519859

4 4 4184389

5 2 3108562

6 10 1052044

7 6 877417

8 14 838940

9 5 651749

10 11 192534

11 100 148135

Underlying table:

• Wikipedia page revision records

• Rows: 314 million

• Byte size: 35.7 GB

Query Stats:

• Scanned 7G of data

• <5 seconds

• ~ 100M rows scanned / second

Page 68: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Leaf Leaf

Distributed Storage

SELECT wp_namespace, revision_id

10 GB / s

COUNT (revision_id)

GROUP BY wp_namespace

WHERE timestamp > CUTOFF

ORDER BY Revs DESC

COUNT (revision_id)

GROUP BY wp_namespace

COUNT (revision_id)

GROUP BY wp_namespace

Page 69: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

"Multi-stage" Query

SELECT

contributor_id,

INTEGER(LOG10(COUNT(revision_id))) LogEdits

FROM [publicdata:samples.wikipedia]

SELECT

contributor_id,

INTEGER(LOG10(COUNT(revision_id))) LogEdits

FROM [publicdata:samples.wikipedia]

GROUP EACH BY contributor_id)

SELECT

LogEdits, COUNT(contributor_id) Contributors

FROM (

SELECT

contributor_id,

INTEGER(LOG10(COUNT(*))) LogEdits

FROM [publicdata:samples.wikipedia]

GROUP EACH BY contributor_id)

GROUP BY LogEdits

ORDER BY LogEdits DESC

Page 70: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Mixer 0

Mixer 1 Mixer 1

Leaf Leaf Shuffler Shuffler

Distributed Storage

SELECT contributor_id

ORDER BY LogEdits DESC

COUNT(contributor_id)

GROUP BY LogEdits

COUNT(contributor_id)

GROUP BY LogEdits

COUNT(contributor_id)

GROUP BY LogEdits

SELECT LE, Id

COUNT(*)

GROUP BY contributor_id

Shuffle by

contributor_id

N^2

GB/s

Page 71: Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

When to use EACH

• Shuffle definitely adds some overhead

• Poor query performance if used incorrectly

• GROUP BY

o Groups << Rows => Unbalanced load

o Example: GROUP BY state

• GROUP EACH BY

o Groups ~ Rows

o Example: GROUP BY user_id