50
Geo Data Analytic

Geo data analytics

Embed Size (px)

Citation preview

Page 1: Geo data analytics

Geo Data Analytics

Page 2: Geo data analytics

@dmarcous

● DBA (@IDF)

● Big Data Professional (@IDF)

● Data Wizard - Magic with Data (@Google - Waze)

Page 3: Geo data analytics

● Pure professional ● Best practices● Tools ● Tips & Tricks● Free Advice!

Page 4: Geo data analytics

Agenda

● Why?● Common Language● Problems at scale● Solutions at scale● Tips & Tricks for scientists

(/Wizards)● Art● Keep an eye out for…● Dog Pictures

Page 5: Geo data analytics

Why Does Geo Data Matter?

Page 6: Geo data analytics
Page 7: Geo data analytics
Page 8: Geo data analytics

● C/C++, GEOS: http://trac.osgeo.org/geos

● C#, NTS: http://code.google.com/p/nettopologysuite/

● Java, JTS:

○ http://tsusiatsoftware.net/jts/main.html

○ http://www.vividsolutions.com/jts/JTSHome.htm

● Python, shapely: https://github.com/Toblerity/Shapely

● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos

● Javascript, JSTS: http://github.com/bjornharrtell/jsts

Page 9: Geo data analytics

Geometry Object Model

Page 10: Geo data analytics

Geospatial Operations

Page 11: Geo data analytics

● WKT / WKB - Geospatial Markup Language○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084

32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204 32.16046394346568,34.807841777801514 32.164333053441936))

○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html● GeoJSON

○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest": "dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084, 32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [ 34.80865716934204, 32.16046394346568 ], [ 34.807841777801514, 32.164333053441936 ]]]}}]}

○ http://geojson.io/#map=17/32.16267/34.81061● Shape Files - ESRI vector format

● GML - The Geography Markup Language (GML) is an XML grammar for expressing geographical features.

● Raster - Display file built from coordinates

Formats

Page 12: Geo data analytics

Databases● RDBMS

○ Postgres (PostGIS)○ MS-SQL / DB2 / Oracle

● NoSQL○ MongoDB○ IBM Cloudant○ Lucene spatial module (elastic/ solr)

● Pure Geospatial Database○ CartoDB (OS / Hosted)○ GeoMesa (Accumulo)

■ GeoTrellis - Scala framework for processing raster data

Page 13: Geo data analytics

GIS Systems

List of most popular ones - http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software

QGIS TileMillGRASS

Page 14: Geo data analytics
Page 15: Geo data analytics

Problem?● Non scalar data types

○ Aggregating○ Sharding○ Unordered

● Speed & Accuracy○ The Physical World is non-euclidian

http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/

Page 16: Geo data analytics

Solution

Page 17: Geo data analytics

Data Structures

● R-Tree (PostGIS, actually R+Tree)● Quad Tree (DB2)● Hyperdimensional Hashing● Space Filling Curves

○ Z Order Curve (MS-SQL)○ Hilbert Curve

Page 18: Geo data analytics

The Curse of Dimensionality

Page 19: Geo data analytics

Dimension Reduction● GeoHash - The mainstream way

○ Linear (non tangant), up to x5 difference in cell area○ Same Prefix - Close areas (sort of…)○ http://geohash.org/○ https://github.com/google/open-location-code

/blob/master/docs/comparison.adoc● S2 - The google way

○ Quadratic, same level cell ~ similar area○ Faces of a projected cube - divided by Quad-Trees to levels -

Referenced to position on face by a Hilbert Curve○ https://code.google.com/p/s2-geometry-library/

Page 20: Geo data analytics

● MongoDB Geospatial Indexing ● elastic / solr spatial indexing● GeoMesa● Build your own - Store the bytes in a fast

key-value store with reduced keys (HBase / Cassandra)

Near Real Time Answers

Page 21: Geo data analytics

● ESRI - Hive UDFs - https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation

● Pigeon - Pig UDFs - https://github.com/aseldawy/pigeon

● Spark -○ SpatialSpark○ GeoTrellis

Big Processing - It’s a UDF World

Page 22: Geo data analytics

Graph Representation● Use Cases

○ Routing○ Supply Chains○ Users Networks

● Tools ○ GraphX (Spark!) / Giraph (MR)○ Dato SGraph (formerly known as GraphLab)○ Gephi (On small parts for exploration)

● Algorithms○ Shortest Path - Dijkstra / A-*○ Communities - Triangle Counting○ Importance - Centrality / Page Rank

Page 23: Geo data analytics

Tips & Tricks

Page 24: Geo data analytics

Approximation

Page 25: Geo data analytics
Page 26: Geo data analytics

Timezones

● tz_world○ http://efele.net/maps/tz/world/○ What do we do with shapefiles?

● APIs○ Geonames○ http://www.earthtools.org/○ Google Timezone API

● UDFs?○ Hive - from_utc_timestamp(timestamp, string timezone)

Page 27: Geo data analytics
Page 28: Geo data analytics

// Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Modified Word Countval textFile = spark.textFile("hdfs://...")val counts = textFile.map(line => line.split(",")) .map(point => (coord2S2Cell(point(1),point(2)), 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// Take that from a library!def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int = { return S2Cell(longitude,latitude, lvl).CellId()}

Good Old Word Count

Page 29: Geo data analytics

Advanced - Precision is of the Essence

● Density Based Clustering

○ DBSCAN■ Minimum cluster size (>

Noise)

■ Epsilon (Spatial Radius)

○ R - MASS - kde2d■ RGoogleMaps for the map

■ http://www.everydayanalytics.ca/2014/04/heatmap-of-toronto-traffic-signals.html

Page 30: Geo data analytics

rJava

● Wrap geospatial functions of your choice● call them from R● Use apply on an entire Dataframe!● Use as features!● Visualize??? (in 5 minutes)

Page 31: Geo data analytics

R Packs for Geospatial Analysis● geonames

○ Timezone○ Weather○ Nearby places

● RGoogleMaps ○ download+paint Maps○ getGeoCode

● sp / maps / maptools○ OGC object abstractions○ Manipulate / display geo data

● rgdal - spTransform○ Convert formats / coordinates systems

● geosphere - distances / circles / centroids● fpc - DBSCAN● Coverage -

○ http://cran.r-project.org/web/views/Spatial.html

Page 32: Geo data analytics
Page 33: Geo data analytics

Engineered Geo features● LOCAL

○ time○ is_early / is_late○ day of week○ is_workday / is_weekend○ is_day_light (sunrise/ sunset tz_world)

● Weather○ Temperature○ is_ Rain/ Fog / Hail / Snow

● Squared (s2cell/ geohash) statistics○ Probability of users in square to predict X

● Address - is_residence / is_business● News - GDELT

Page 34: Geo data analytics

WOW!

Page 35: Geo data analytics

Data Art

Page 36: Geo data analytics

Google Sheets

Page 37: Geo data analytics

Frontend = Javascript?

● Google Maps API○ https://developers.google.com/maps/documentation/javascript/examples/layer-

heatmap

● Leaflet

Page 38: Geo data analytics

R for Visualisation

● ggplot2 + geospatial packs○ http://uce.uniovi.es/mundor/howtoplotashapemap.html○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l○ http://spatial.ly/2012/02/great-maps-ggplot2/

● RGoogleMaps○ http://rforwork.info/tag/rgooglemaps/

Page 39: Geo data analytics
Page 40: Geo data analytics
Page 41: Geo data analytics
Page 42: Geo data analytics
Page 43: Geo data analytics

R For Interactive

● Shiny○ Leaflet

■ http://rstudio.github.io/leaflet/■ http://shiny.rstudio.com/gallery/superzip-example.html■ http://shiny.rstudio.com/gallery/bus-dashboard.html

○ Globe■ https://github.com/trestletech/shinyGlobe

Page 44: Geo data analytics
Page 45: Geo data analytics

R Animation

● http://rmaps.github.io/blog/posts/animated-choropleths/

Page 46: Geo data analytics

@aaronkoblin

Page 47: Geo data analytics

Keep an Eye Out!

https://locationtech.org/list-of-projects

Page 48: Geo data analytics
Page 49: Geo data analytics
Page 50: Geo data analytics

Contact

● Daniel Marcous● [email protected]