Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Conference 2014

Bigdata: Analtica para tu api con Redis y Google. javier ramirez

javier ramirez@supercoco9

Big Data Analyticswith Google BigQuery

javier ramirez @supercoco9 https://teowaki.comnosqlmatters 2013

REST API +AngularJS web as an API client

nadie duda de que tu api sea tcnicamente muy buena, pero...

javier ramirez @supercoco9 https://teowaki.com

conclusin obvia

esto va a ser un problema de big data

el problema es que nosotros no sabamos de big data. Nos sonaba map/reduce, hadoop, cassandra.. pero nos faltaban datos

bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds


Javier Ramirezimpresionable teowaki founder

esto hace dos aos era imposible. vivimos en el futuro

para poder ejecutar consultas sobre nuestros datos, el primer punto es extraerlos de nuestro sistema, que en nuestro caso significa extraer la informacin de las peticiones del usuario conforme van pasando

Apache HadoopApache Cassandra

Apache SparkApache Storm

Amazon Redshift


bigdata is cool but...

expensive cluster

hard to set up and monitor

not interactive enough

Our choice:

Google BigQuery

Data analysis as a service

http://developers.google.com/bigquery


Based on Dremel

Specifically designed for interactive queries over petabytes of real-time data


Apache Drill es el equivalente en open source. No funciona como servicio.

bigquery es un recubrimiento REST encima de dremel. Usable desde cualquier plataforma que permita REST. Apis disponibles para diferentes lenguajes

Solamente para inserciones! no borrados o updates.A menudo junto Map/reduce o hadoop. Anlisis in place, sin carga previa, sin ndices ni planificar las queries de antemano

Analysis of crawled web documents. Tracking install data for applications on Android Market. Crash reporting for Google products. OCR results from Google Books. Spam analysis. Debugging of map tiles on Google Maps. Tablet migrations in managed Bigtable instances. Results of tests run on Googles distributed build system. Disk I/O statistics for hundreds of thousands of disks. Resource monitoring for jobs run in Googles data centers. Symbols and dependencies in Googles codebase.

What Dremel is used for in Google


in BigQuery everything is a full-scan*

*Over a ridiculously fast distributed filesystem.Dremel design goal: 1TB/sec. It was exceeded

BigQuery delivers ~ 50Gb/Sec.

next: full scan regexp

Columnarstorage


Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data.

also less I/O

Adems Dremel proporciona una estructura en rbol para lanzar las queries

highly distributed execution using a tree

javier ramirez @supercoco9 https://teowaki.comrubyc kiev 14

batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)

pagas por lo que usas

loading data

You can feed flat CSV-like files or nested JSON objects


batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)

pagas por lo que usas


bq cli

bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz

la carga puede ser de fichero plano (tsv para evitar problemas de comillas) o con json si necesitas estructura. Importar desde consola web, REST o command line

se pueden importar ficheros comprimidos

tambin se puede importar informacin para tiempo real en modo stream

NEXT: CONCEPTOS DE BIGQUERY

web console screenshot


web consoleapi restcommand line

Notice the validate button to avoid expenses


analytical SQL functions.correlations.window functions.views.JSON fields.timestamped tables.

next: full scan regexp

Things you always wanted to try but were too scared to


select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;

223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32)

total313,797,035

Global Database of Events, Language and Tone

quarter billion rows30 yearsupdated daily

http://gdeltproject.org/data.html#googlebigquery

global database of events, languageand tone

quarter billion rows30 yearsupdated daily

SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)WHERE rank=1ORDER BY Year



also less I/O


Automation with Apps Script

Read from bigquery

Create a spreadsheet on Drive

E-mail it everyday as a PDF



also less I/O


what is it being used for?

Analysing weather information

Finding patterns in e-commerce

Match online/offline behaviour

Log analysys

Analysing inventory/booking data...

bigquery pricing

$80 per stored TB1000000 rows => $0.02288 / month

$35 per processed TB1 full scan ~ 240 MB1 count = 0 MB1 full scan over 1 column ~ 13 MB10 GB => $0.35 / month*the 1st TB processed every month is free of charge


Find related links at

https://teowaki.com/teams/javier-community/link-categories/bigquery-talk

Thanks

Javier Ramrez@supercoco9

Software

Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Conference 2014