Upload
godatadriven
View
67
Download
5
Embed Size (px)
Citation preview
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni Lanzani Data Whisperer
and SQL vs NoSQL databases
GoDataDriven
Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees FROM NoSQLMatters WHERE password = '1234';
GoDataDriven
Is it Big Data?Everybody talks about it
Nobody knows how to do it Everyone thinks everyone else is doing it, so everyone
claims they’re doing it… Dan Ariely
GoDataDriven
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location + radius.
4. Real-Time Retrieval
GoDataDriven
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps: everything is a pandas df!
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s!
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
• We all hate CSV’s!
GoDataDriven
Issues?!
• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: !
!
!
• Index on date and postcode, but single queries running more than 20 minutes.
SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array;
GoDataDriven
PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries:
SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates
Postgres + Postgis (2.x)
GoDataDriven
How we solved it1. Align data on disk by date; 2. Use the temporary table trick:
!
!
!
!
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array; !SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;
GoDataDriven
Take home messages1. Geospatial problems are hard and queries can be
really slow; 2. Not everybody has infinite resources: be smart
and KISS! 3. SQL or NoSQL? (Size, schema)
GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani [email protected]
Giovanni Lanzani Data Whisperer