If you can't read please download the document
Upload
javier-ramirez
View
1.276
Download
0
Embed Size (px)
Citation preview
Bigdata: Analtica para tu api con Redis y Google. javier ramirez
javier ramirez@supercoco9
Big Data Analyticswith Google BigQuery
javier ramirez @supercoco9 https://teowaki.comnosqlmatters 2013
REST API +AngularJS web as an API client
nadie duda de que tu api sea tcnicamente muy buena, pero...
javier ramirez @supercoco9 https://teowaki.com
conclusin obvia
esto va a ser un problema de big data
el problema es que nosotros no sabamos de big data. Nos sonaba map/reduce, hadoop, cassandra.. pero nos faltaban datos
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
javier ramirez @supercoco9 https://teowaki.com
Javier Ramirezimpresionable teowaki founder
esto hace dos aos era imposible. vivimos en el futuro
para poder ejecutar consultas sobre nuestros datos, el primer punto es extraerlos de nuestro sistema, que en nuestro caso significa extraer la informacin de las peticiones del usuario conforme van pasando
Apache HadoopApache Cassandra
Apache SparkApache Storm
Amazon Redshift
javier ramirez @supercoco9 https://teowaki.com
bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://teowaki.com
Based on Dremel
Specifically designed for interactive queries over petabytes of real-time data
javier ramirez @supercoco9 https://teowaki.com
Apache Drill es el equivalente en open source. No funciona como servicio.
bigquery es un recubrimiento REST encima de dremel. Usable desde cualquier plataforma que permita REST. Apis disponibles para diferentes lenguajes
Solamente para inserciones! no borrados o updates.A menudo junto Map/reduce o hadoop. Anlisis in place, sin carga previa, sin ndices ni planificar las queries de antemano
Analysis of crawled web documents. Tracking install data for applications on Android Market. Crash reporting for Google products. OCR results from Google Books. Spam analysis. Debugging of map tiles on Google Maps. Tablet migrations in managed Bigtable instances. Results of tests run on Googles distributed build system. Disk I/O statistics for hundreds of thousands of disks. Resource monitoring for jobs run in Googles data centers. Symbols and dependencies in Googles codebase.
What Dremel is used for in Google
javier ramirez @supercoco9 https://teowaki.com
in BigQuery everything is a full-scan*
*Over a ridiculously fast distributed filesystem.Dremel design goal: 1TB/sec. It was exceeded
BigQuery delivers ~ 50Gb/Sec.
next: full scan regexp
Columnarstorage
javier ramirez @supercoco9 https://teowaki.com
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data.
also less I/O
Adems Dremel proporciona una estructura en rbol para lanzar las queries
highly distributed execution using a tree
javier ramirez @supercoco9 https://teowaki.comrubyc kiev 14
batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)
pagas por lo que usas
loading data
You can feed flat CSV-like files or nested JSON objects
javier ramirez @supercoco9 https://teowaki.com
batch y tiempo real tanto en la entrada de datos (ficheros o stream) como en la salida (interactivo o batch)
pagas por lo que usas
javier ramirez @supercoco9 https://teowaki.com
bq cli
bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz
la carga puede ser de fichero plano (tsv para evitar problemas de comillas) o con json si necesitas estructura. Importar desde consola web, REST o command line
se pueden importar ficheros comprimidos
tambin se puede importar informacin para tiempo real en modo stream
NEXT: CONCEPTOS DE BIGQUERY
web console screenshot
javier ramirez @supercoco9 https://teowaki.com
web consoleapi restcommand line
Notice the validate button to avoid expenses
javier ramirez @supercoco9 https://teowaki.com
analytical SQL functions.correlations.window functions.views.JSON fields.timestamped tables.
next: full scan regexp
Things you always wanted to try but were too scared to
javier ramirez @supercoco9 https://teowaki.com
select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32)
total313,797,035
Global Database of Events, Language and Tone
quarter billion rows30 yearsupdated daily
http://gdeltproject.org/data.html#googlebigquery
global database of events, languageand tone
quarter billion rows30 yearsupdated daily
SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)WHERE rank=1ORDER BY Year
javier ramirez @supercoco9 https://teowaki.com
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data.
also less I/O
Adems Dremel proporciona una estructura en rbol para lanzar las queries
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @supercoco9 https://teowaki.com
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data.
also less I/O
Adems Dremel proporciona una estructura en rbol para lanzar las queries
what is it being used for?
Analysing weather information
Finding patterns in e-commerce
Match online/offline behaviour
Log analysys
Analysing inventory/booking data...
bigquery pricing
$80 per stored TB1000000 rows => $0.02288 / month
$35 per processed TB1 full scan ~ 240 MB1 count = 0 MB1 full scan over 1 column ~ 13 MB10 GB => $0.35 / month*the 1st TB processed every month is free of charge
javier ramirez @supercoco9 https://teowaki.com
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks
Javier Ramrez@supercoco9