Big Data and Analytics with ArcGIS - Esri€¦ · HQL drop table if exists logs; create external...

Preview:

Citation preview

Big Data and Analytics with ArcGISCanserina Kurnia

Technical Manager – Esri Global Asia Pacific

Agenda

• What is Big Data?

• What is Hadoop?

• How does Spatial integrate with Big Data and

Hadoop?

• How do I get started?

Story Time…

U.S.

Demographic

Data

FOR EACH LOCATION

FOR EACH DEMOGRAPHIC

⬇50 MILE HEATMAP

Traditional Means…

14 Days

850 GB Raster Files

Better Way ?

What is BigData ?

7 B I L L I O N

50% LIVE IN CITIES !

~70% By 2050 ! ! !

http://www.who.int/gho/urban_health/situation_trends/urban_population_growth_text/en/

Academics

Volume

Velocity Variety

Volume

Velocity

Variety

Veracity

Validity

Visualization

Vulnerability

Value

But then I’ve seen…

→ data at rest

→ data in motion

→ many types

→ data in doubt

→ data that is correct

→ data in patterns

→ data at risk

→ data that is meaningful

“When the traditional

means are failing you”-Anonymous

What are the new means?

http://hadoop.apache.org

What’s in a name ?

http://blog.pivotal.io/pivotal/products/demystifying-hadoop-in-5-pictures

What Is Hadoop ?• Library / Framework

• Very Very Large Un/Structured

Dataset

• Multi Node Distributed Processing

• Resilient To Commodity Hardware

Failure

Hadoop Basic Stack

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Commodity Servers

MapReduce Hive HBase

Other Hadoop Projects• Avro - Serialization / RPC System

• HBase - Distributed Columnar Database

• Hive - Ad Hoc “SQL” Interface

• Pig - Data Flow Parallel Execution (AML)

• ZooKeeper - Coordination Service

• More….

HDFS• Distributed File System

• Lots and Lots of Commodity Drives

• Fault Tolerant

• Loves Big Files

• “POSIX” Like Interface

HDFS

NameNode

DataNode DataNode DataNode

HDFS Client

HDFS Resilience !

HDFS

DataNode DataNode DataNode

BigData

Program

BigData

Program

MapReduce

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

What Is MapReduce ?• Parallel Fault Tolerant Framework

• Splits Large Input

• Invoke User Defined “Map” Function

• Shuffle and Sort

• Invoke User Defined “Reduce” Function

Data

Node

Task

Tracker

Data

Node

Task

Tracker

Data

Node

Task

Tracker

Name

Node

Job

TrackerClient

MapReduce & HDFS.jar

Thinking In MR

K1,V1

Map list(K2,V2)

Shuffle/Sort

K2,list(V2)

Reduce list(K3,V3)

(filter & transform) (group & aggregate)

Geo MapReduce

DensityMapID1,X1,Y1

ID2,X2,Y2

ID3,X3,Y3

ID4,X4,Y4

DensityMapfunction map(lineno,text)

{

tokens = text.split(‘,’)

cell = toCell(tokens[1],tokens[2])

emit( cell, 1)

}

function toCell(x,y)

{

// some math !!

return cell

}

function reduce(cell,iterator)

{

sum = 0

for( one : iterator)

sum += one

emit( cell, sum)

}

http://thunderheadxpler.blogspot.com/2013/03/bigdata-kernel-density-analysis-on.html

Writing MapReduce Is

Hard…

http://www.cascading.org

Think of Data

as

Water In Pipes

Cascading pipeline

⬇MapReduce Job

To CellGroupBy

count

X,Y

Collection

Cell

Count

Workflow Pipeline

RM

SourceSink

Filter

Cascading Pipe

// Pipe tap x,y input fields into spatial function

Pipe pipe = new Each("start", new Fields("X", "Y"), new SpatialDensity());

// Group by emitted ‘cell’ value

pipe = new GroupBy(pipe, “cell”);

// Count by group and name count ‘POPULATION’

pipe = new Every(pipe, Fields.GROUP, new Count(new Fields("POPULATION")));

http://thunderheadxpler.blogspot.com/2014/01/cascading-workflow-for-spatial-binning.html

How About….

No Programing ???

Apache HIVE

“SQL”

⬇MapReduce Job

HQLdrop table if exists logs;

create external table if not exists logs(

ip string,

method string,

uri string,

status string,

bytes int,

time_taken int,

referrer string,

user_agent string

) partitioned by (year int, month int, day int, hour int)

row format delimited

fields terminated by '\t'

lines terminated by '\n'

stored as textfile

location ‘hdfs://hadoop:8020/logs/';

Other AdHoc Engines• Cloudera Impala

• Facebook Presto

• SparkSQL

• Bypass MR generation / Direct HDFS Access

What About Spatial ?

GIS Tools For Hadoop• Computational Geometry Library

• Hive Spatial UDF Functions

• GeoProcessing Extensions to ArcMap

Geometry Library• Points / Lines / Polygons

• I/O (GeoJSON,WTK,WBT,Shape)

• Spatial Relations (inside, touches, intersects,…)

• Spatial Operations (buffer, cut, convex hull,…)

• In-Memory Spatial Index

API Usage in BigData• Map-only jobs - GeoEnrichment

- Given set of locations

- Given demographic area

- Augment location with demographic attributes

BigData Binning

BigData Binning

BigData Binning

Hive Spatial UDF• Uses Geometry API

• Constructor

- ST_POINT / ST_GeomFromGeoJSON

• Relations

- ST_Contains / ST_Buffer

• Accessor

- ST_Distance, ST_Area

Hive Spatial UDF

SELECT counties.name, count(*) total FROM countiesJOIN earthquakesWHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))GROUP BY counties.nameORDER BY total desc;

GP Extensions

ArcMap

HDFS

Hive/MapReduce

Workflow

PROCESSING EVOLUTION

• Transaction - Batch

• Operational - Dashboard

• Analytics - Exploration

• Intelligent - Realtime / Predictive

Fixed

Schema

Variable

Schema

Big Data Partners

And More….

Blog Post: http://thunderheadxpler.blogspot.com

Thank you