Upload
seagor
View
1.058
Download
0
Embed Size (px)
Citation preview
Spatial Analytics WorkshopPete Skomoroch, LinkedIn (@peteskomoroch)Kevin Weil, Twitter (@kevinweil)Sean Gorman, FortiusOne (@seangorman)
#spatialanalytics
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Spatial Analysis
Analytical techniques to determine the spatial
distribution of a variable, the relationship between
the spatial distribution of variables, and the
association of the variables in an area.
Pattern Analysis
Spatial Analysis Types
1. Spatial autocorrelation
2. Spatial interpolation
3. Spatial interaction
4. Simulation and modeling
5. Density mapping
Spatial Autocorrelation
Spatial autocorrelation statistics measure and analyze
the degree of dependency among observations in a
geographic space.
First law of geography: “everything is related to everything
else, but near things are more related than distant things.”
-- Waldo Tobler
Moran’s I - Random Variable
Moran’s I = .012
Moran’s I - Per Capita Income in Monroe County
Moran’s I = .66
Spatial Interpolation
Spatial interpolation methods estimate the variables
at unobserved locations in geographic space based
on the values at observed locations.
Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
$7.55
$14.00
$14.00
Henry
NYC
Chicago
Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
$16.00
$30.00
$18.50
Henry
NYC
Chicago
Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
$22.00
$37.00
$20.00
Henry
NYC
Chicago
Spatial Interaction
Spatial interaction or “gravity models” estimate
the flow of people, material, or information
between locations in geographic space.
Introduction‣ Motiviation
‣ Execution
‣ Prototype
‣ Service
‣ API
‣ Operations
‣ UX
Global Oil Supply and Demand Gravity Model
Simulation and Modeling
Simple interactions among proximal entities can
lead to intricate, persistent, and functional spatial
entities at aggregate levels (complex adaptive
systems).
Spatial Interdependency Analysis of the San Francisco Failure Simulation
InfrastructureTotal Number of Links
No. Links Congested
% Links Congested
%Volume Delay
Refined Products (National)
3,197 1 0.03% 0.05%Refined Products (MSA) 8 1 12.50% 93%
Power Grid (Regional) 1,942 4 0% N/A
Power Grid (MSA) 16 2 13% N/A
Density Mapping
Calculating the proximity and frequency of a
spatial phenomenon by creating a probabilistic
surface.
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics
Queueing analysis tasks from disparate data sources
for agents to run across distributed servers to collate
back to the user as answers.
User
Request Queue
Disparate Data
Age
nts
Dis
trib
uted
Ser
vers
Analysis
User
Request Queue
Agent
Amazon S3
Amazon EC2
1. Rasterize2. Kernel
density calc3. Color map
(http://finder.geocommons.com/overlays/20148)
Vector Density Mapping Demo
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Data is Getting Big‣ NYSE: 1 TB/day
‣ Facebook: 20+ TB compressed/day
‣ CERN/LHC: 40 TB/day (15 PB/year!)
‣ And growth is accelerating
‣ Need multiple machines, horizontal scalability
Hadoop‣ Distributed file system (hard to store a PB)
‣ Fault-tolerant, handles replication, node failure, etc
‣ MapReduce-based parallel computation(even harder to process a PB)
‣ Generic key-value based computation interfaceallows for wide applicability
‣ Open source, top-level Apache project
‣ Scalable: Y! has a 4000-node cluster
‣ Powerful: sorted a TB of random integers in 62 seconds
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
MapReduce?‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to 2x faster.
cat file | grep geo | sort | uniq -c > output
But...‣ Analysis typically done in Java
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins: lengthy, error-prone
‣ n-stage jobs: Hard to manage
‣ Prototyping/exploration requires compilation
‣ analytics in Eclipse?ur doin it wrong...
Enter Pig
‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ Easier than SQL?
Why Pig?‣ Because I bet you can read the following script.
A Real Pig Script
‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis
‣ The Pig version is:
‣ 5% of the code, 5% of the time
‣ Within 50% of the execution time.
‣ Pig ♥ Geo:
‣ Programmable: fuzzy matching, custom filtering
‣ Easily link multiple datasets, regardless of size/structure
‣ Iterative, quick
A Real Example
‣ Fire up your EMR.
‣ ... or follow along at http://bit.ly/whereanalytics
‣ Pete used Twitter’s streaming API to store some tweets
‣ Simplest thing: group by location and count with Pig
‣ http://bit.ly/where20pig
‣ Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
tweets_with_location = FILTER tweets BY user_location != 'NULL';
normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30
brasil 37985indonesia 33777brazil 22432london 17294usa 14564são paulo 14238new york 13420tokyo 10967singapore 10225rio de janeiro 10135los angeles 9934california 9386chicago 9155uk 9095jakarta 9086germany 8741canada 8201東京都 7696
東京 7121
jakarta, indonesia 6480nyc 6456new york, ny 6331
Neat, but...
‣ Wow, that data is messy!
‣ brasil, brazil at #1 and #3
‣ new york, nyc, and new york ny all in the top 30
‣ Pete to the rescue.
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Introduction‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
Questions? Follow us attwitter.com/peteskomorochtwitter.com/kevinweiltwitter.com/seangorman