81
Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Copyright @2013, Concurrent, Inc. “Using Cascalog to build an app based on City of Palo Alto Open Data” 1 Monday, 28 January 13

Using Cascalog to build an app based on City of Palo Alto Open Data

Embed Size (px)

Citation preview

Page 1: Using Cascalog to build an app based on City of Palo Alto Open Data

Paco NathanConcurrent, Inc.San Francisco, CA@pacoid

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Copyright @2013, Concurrent, Inc.

“Using Cascalog to build an app based on City of Palo Alto Open Data”

1Monday, 28 January 13

Page 2: Using Cascalog to build an app based on City of Palo Alto Open Data

This project began as a machine learning workshop for a graduate seminar at CMU West

Many thanks to:

Stuart Evans, CMU Distinguished Service Professor

Jonathan Reichental,City of Palo Alto CIO

We use Cascalog to develop a Big Data workflow

Open Source: github.com/Cascading/CoPA/wiki

2Monday, 28 January 13

Page 3: Using Cascalog to build an app based on City of Palo Alto Open Data

Palo Alto is generally quite a pleasant place

• temperate weather

• lots of parks, enormous trees

• great coffeehouses

• walkable downtown

• not particularly crowded

• friendly VCs (sort of)

On a nice summer day, who wantsto be stuck indoors on a phone call?

Instead, take it outside – go for a walk

3Monday, 28 January 13

Page 4: Using Cascalog to build an app based on City of Palo Alto Open Data

Surely, there must be an app for that…

But wait, there isn’t?

So let’s build one!

source: Apple

4Monday, 28 January 13

Page 5: Using Cascalog to build an app based on City of Palo Alto Open Data

process

source: algaelab.org

5Monday, 28 January 13

Page 6: Using Cascalog to build an app based on City of Palo Alto Open Data

1. unstructured data about municipal infrastructure(GIS data: trees, roads, parks)

2. unstructured data about where people like to walk(smartphone GPS logs)

3. a wee bit o’ curated metadata

⇒4. personalized recommendations:

“Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.”

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

6Monday, 28 January 13

Page 7: Using Cascalog to build an app based on City of Palo Alto Open Data

“unstructured” vs. “structured” datais actually quite a Big Debate

refer back to Edgar Codd 1969 to learn about the Relational Model

relational != SQLbut I digress…

7Monday, 28 January 13

Page 8: Using Cascalog to build an app based on City of Palo Alto Open Data

Data Science work must focus on the process of structuring data

which must occur long before thelarge-scale joins, predictive models, visualizations, etc.

So, the process of structuring data is what we examine here:

i.e., how to build workflows for Big Data

thank you Dr. Codd

“A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685

8Monday, 28 January 13

Page 9: Using Cascalog to build an app based on City of Palo Alto Open Data

references

by DJ Patil

Data JujitsuO’Reilly, 2012

amazon.com/dp/B008HMN5BE

Building Data Science TeamsO’Reilly, 2011

amazon.com/dp/B005O4U3ZE

9Monday, 28 January 13

Page 10: Using Cascalog to build an app based on City of Palo Alto Open Data

references

by Leo Breiman

Statistical Modeling: The Two CulturesStatistical Science, 2001

bit.ly/eUTh9L

also check out RStudio:rstudio.org/rpubs.com/

10Monday, 28 January 13

Page 11: Using Cascalog to build an app based on City of Palo Alto Open Data

Generally speaking, we could approach the matter of developing an Open Data app through these steps:

• clean up the raw, unstructured data from CoPA download (ETL)

• before modeling, perform visualization and analysis in RStudio

• spend time on ideation and research for potential use cases

• iterate on business process for the app workflow

• integrate with use cases represented by the workflow taps

• apply best practices and TDD at scale

• …PROFIT!

source: South Park

11Monday, 28 January 13

Page 12: Using Cascalog to build an app based on City of Palo Alto Open Data

discovery

modeling

integration

apps

systems

help people ask the right questions

allow automation to place informed bets

deliver products at scale to customers

build smarts into product features

keep infrastructure running, cost-effective

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode

In terms of actual process used in Data Science, here’s how my teams have worked:

12Monday, 28 January 13

Page 13: Using Cascalog to build an app based on City of Palo Alto Open Data

For the process used with this Open Data app, we chose to use Cascalog

by Nathan Marz, Sam Ritchie, et al., 2010

a DSL in Clojure which implements Datalog, backed by Cascading

Some aspects of CS theory:

• Functional Relational Programming

• mitigates Accidental Complexity

• has been compared with Codd 1969

github.com/nathanmarz/cascalog/wiki

13Monday, 28 January 13

Page 14: Using Cascalog to build an app based on City of Palo Alto Open Data

Q:

Who uses Cascalog, other than Twitter?

A:

• Climate Corp (they’re hiring, ask for Crea)

• Factual

• Nokia Maps

• Harvard School of Public Health

• YieldBot (PDX)

• uSwitch (London)

• etc.

14Monday, 28 January 13

Page 15: Using Cascalog to build an app based on City of Palo Alto Open Data

pro:

• 10:1 reduction in code volume compared to SQL

• most advanced uses of Cascading

• Leiningen build: simple, no surprises, in Clojure itself

• test-driven development (TDD) for Big Data

• fault-tolerant workflows which are simple to follow

• machine learning, map-reduce, etc., started in LISP years ago anywho

con:

• learning curve, limited number of Clojure developers

• aggregators are the magic, those take effort to learn

15Monday, 28 January 13

Page 16: Using Cascalog to build an app based on City of Palo Alto Open Data

Accidental Complexity:

Not O(N^2) complexity, but the costs of software engineering at scale over time

What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps?

Cascalog allows for leveraging the same framework, same code base, from Discovery phase through to Systems phase

It focuses on the process of structuring data:specify what you require, not how it must be achieved

Huge implications for software engineering

16Monday, 28 January 13

Page 17: Using Cascalog to build an app based on City of Palo Alto Open Data

discovery

source: 2001 A Space Odyssey

17Monday, 28 January 13

Page 18: Using Cascalog to build an app based on City of Palo Alto Open Data

The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-information/

discovery

18Monday, 28 January 13

Page 19: Using Cascalog to build an app based on City of Palo Alto Open Data

GIS about trees in Palo Alto:discovery

19Monday, 28 January 13

Page 20: Using Cascalog to build an app based on City of Palo Alto Open Data

GIS about roads in Palo Alto:discovery

20Monday, 28 January 13

Page 21: Using Cascalog to build an app based on City of Palo Alto Open Data

Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"

discovery

(um, bokay…)

21Monday, 28 January 13

Page 22: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )  (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) ))

discovery

(specify what you require, not how to achieve it…addressing the 80%)

22Monday, 28 January 13

Page 23: Using Cascalog to build an app based on City of Palo Alto Open Data

discovery

(convert ad-hoc queries into logical propositions)

23Monday, 28 January 13

Page 24: Using Cascalog to build an app based on City of Palo Alto Open Data

Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AVTree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0Point

discovery

(obtain recognizable results)

24Monday, 28 January 13

Page 25: Using Cascalog to build an app based on City of Palo Alto Open Data

discovery

(curate valuable metadata)

25Monday, 28 January 13

Page 26: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

26Monday, 28 January 13

Page 27: Using Cascalog to build an app based on City of Palo Alto Open Data

?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl?tree_id!" 412?situs" " 115?tree_site" 1?species"" liquidambar styraciflua?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598?avg_height"27.5?tree_lat" 37.446001565119?tree_lng" -122.167713417554?tree_alt" 0.0?geohash"" 9q9jh0

discovery

(et voilà, a data product)

27Monday, 28 January 13

Page 28: Using Cascalog to build an app based on City of Palo Alto Open Data

// run some analysis and visualization in Rlibrary(ggplot2)

dat_folder <- '~/src/concur/CoPA/out/tree'data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),

sep="\t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")

 summary(data)

t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees))

discovery

28Monday, 28 January 13

Page 29: Using Cascalog to build an app based on City of Palo Alto Open Data

discovery

sweetgum

29Monday, 28 January 13

Page 30: Using Cascalog to build an app based on City of Palo Alto Open Data

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

discovery

(flow diagram, gis ⇒ tree)

30Monday, 28 January 13

Page 31: Using Cascalog to build an app based on City of Palo Alto Open Data

The conceptual flow diagram shows a directed, acyclic graph (DAG) of taps, tuple streams, functions, joins, aggregations, assertions, etc.

Cascading is formally a pattern language – patterns of “plumbing” fit together to ensure best practices for large-scale parallel processing in risk-aversive environments – hard requirements of Enterprise IT

In other words, Cascading forces functional programming through an API for JVM-based languages such as Java, Scala, Clojure

Through this approach, we define Enterprise Data Workflows

definitions

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

31Monday, 28 January 13

Page 32: Using Cascalog to build an app based on City of Palo Alto Open Data

pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices

amazon.com/dp/0195019199

design patterns: originated in consensus negotiation for architecture, later used in OOP software engineering

amazon.com/dp/0201633612

definitions

32Monday, 28 January 13

Page 33: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo

?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

33Monday, 28 January 13

Page 34: Using Cascalog to build an app based on City of Palo Alto Open Data

?blurb" " " Hawthorne Avenue from Alma Street to High Street?traffic_count"3110?traffic_class"local residential?surface_type" asphalt concrete?albedo" " " 0.12?min_lat"" " 37.446140860599854"?min_lng " " -122.1674652295435?min_alt " " 0.0?geohash"" " 9q9jh0

discovery

(another data product)

34Monday, 28 January 13

Page 35: Using Cascalog to build an app based on City of Palo Alto Open Data

discoveryThe road data provides:

• traffic class (arterial, truck route, residential, etc.)

• traffic counts distribution

• surface type (asphalt, cement; age)

This leads to estimators for noise, reflection, etc.

35Monday, 28 January 13

Page 36: Using Cascalog to build an app based on City of Palo Alto Open Data

GISexport

Regexparse-gis

src

FailureTraps

M

M

road

RoadMetadata

Join EstimateAlbedo Geohash

road

Regexparse-road

RoadSegments

R

discovery

(flow diagram, gis ⇒ road)

36Monday, 28 January 13

Page 37: Using Cascalog to build an app based on City of Palo Alto Open Data

modeling

source: America’s Next Top Model

37Monday, 28 January 13

Page 38: Using Cascalog to build an app based on City of Palo Alto Open Data

GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude

Geo data is great for managing municipal infrastructure as well as for mobile apps

Predictive modeling in our Open Data example focuses on leveraging geolocation

We use spatial indexing by creating a grid of geohash values, for efficientparallel processing

Cascalog queries collect items with thesame geohash values – using them as keysfor large-scale joins (Hadoop)

modeling

38Monday, 28 January 13

Page 39: Using Cascalog to build an app based on City of Palo Alto Open Data

9q9jh0

geohash with 6-digit resolution

approximates a 5-block square

centered lat: 37.445, lng: -122.162

modeling

39Monday, 28 January 13

Page 40: Using Cascalog to build an app based on City of Palo Alto Open Data

Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns:

" -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0

modeling

( lat0, lng0, alt0 )

( lat1, lng1, alt1 )

( lat2, lng2, alt2 )

( lat3, lng3, alt3 )

NB: segments in the raw GIS have the orderof geo coordinates scrambled: (lng, lat, alt)

40Monday, 28 January 13

Page 41: Using Cascalog to build an app based on City of Palo Alto Open Data

Our app analyzes each road segment as a data tuple,calculating the center point for each:

modeling

( lat, lng, alt )

41Monday, 28 January 13

Page 42: Using Cascalog to build an app based on City of Palo Alto Open Data

Then uses a geohash to define a grid cell, as a boundary (or “canopy”):

modeling

9q9jh0

42Monday, 28 January 13

Page 43: Using Cascalog to build an app based on City of Palo Alto Open Data

9q9jh0

Query to join a road segment tuple with all the trees within its geohash boundary:

modeling

43Monday, 28 January 13

Page 44: Using Cascalog to build an app based on City of Palo Alto Open Data

X X

X

Use distance-to-midpoint to filter trees which are too far away to provide shade:

modeling

44Monday, 28 January 13

Page 45: Using Cascalog to build an app based on City of Palo Alto Open Data

Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade:

modeling

∑( h·d )

We also calculate estimators for traffic frequency and noise

45Monday, 28 January 13

Page 46: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng

?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _

?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _)

(road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)

(trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)

(read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance

?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment

;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))

modeling

46Monday, 28 January 13

Page 47: Using Cascalog to build an app based on City of Palo Alto Open Data

?road_name" " Hawthorne Avenue from Alma Street to High Street?geohash"" " 9q9jh0?road_lat" " 37.446140860599854?road_lng " " -122.1674652295435?road_alt " " 0.0?road_metric" [1.0 0.5488121277250486 0.88]?tree_metric" 4.36321007861036

(another data product)

modeling

47Monday, 28 January 13

Page 48: Using Cascalog to build an app based on City of Palo Alto Open Data

M

tree

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

(flow diagram, shade)

modeling

48Monday, 28 January 13

Page 49: Using Cascalog to build an app based on City of Palo Alto Open Data

modeling

49Monday, 28 January 13

Page 50: Using Cascalog to build an app based on City of Palo Alto Open Data

modeling

50Monday, 28 January 13

Page 51: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs

?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance)

(read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))

modeling

(behavioral targeting: aggregate GPS tracks by

recency, frequency)

51Monday, 28 January 13

Page 52: Using Cascalog to build an app based on City of Palo Alto Open Data

Mgps

Countgps_count

R

Geohashgpslogs Max

recent_visit

(flow diagram, gps)

modeling

52Monday, 28 January 13

Page 53: Using Cascalog to build an app based on City of Palo Alto Open Data

?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hv6 1 1972376691180342ac6fd3f5f44c6b97724d618d587cf 9q9hv8 18 1972376691028342ac6fd3f5f44c6b97724d618d587cf 9q9hv9 7 1972376691101342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

(GPS personalization)

modeling

53Monday, 28 January 13

Page 54: Using Cascalog to build an app based on City of Palo Alto Open Data

(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt

?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))

modeling

(finally, the recommender)

54Monday, 28 January 13

Page 55: Using Cascalog to build an app based on City of Palo Alto Open Data

Recommenders combine multiple signals, generally via weighted averages, to rank personalized results:

•GPS of person ∩ road segment

• frequency and recency of visit

• traffic class and rate

• road albedo (sunlight reflection)

• tree shade estimator

Adjusting the mix allows for furtherpersonalization at the end use

modeling

55Monday, 28 January 13

Page 56: Using Cascalog to build an app based on City of Palo Alto Open Data

integration

source: Wolfram

56Monday, 28 January 13

Page 57: Using Cascalog to build an app based on City of Palo Alto Open Data

integration

Hadoop is rarely ever used in isolation

System integration is a hard problem in Big Data, especially social aspects: breaking down silos

Cascading was built for this purpose:

• taps across many data frameworks: HBase, Cassandra, MongoDB, etc.

• support for a variety of data serialization: Avro, Thrift, Kryo, JSON, etc.

•planning on multiple topologies: MapReduce, in-memory, tuple spaces, etc.

• test-driven development (TDD) at scale

•ANSI SQL-92 integration, PMML, etc.

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

57Monday, 28 January 13

Page 58: Using Cascalog to build an app based on City of Palo Alto Open Data

integration

This example focuses on the batch workflowto examine best practices for parallel processing

Integrating with a mobile app requires next steps:

•push “reco” output to a Redis cluster (caching layer) via a Cascading tap

• leverage Redis “sorted sets” for ranking personalized results

• create lightweight API in Node.js + Nginx for low-latency access at scale

• collect social interactions in Splunk

• instrument via Nagios, New Relic, Flurry, etc.

That provides a data service – doesn’t even begin to address: design, user experience, marketing, implementation, etc., for a complete app…

58Monday, 28 January 13

Page 59: Using Cascalog to build an app based on City of Palo Alto Open Data

Hadoop cluster

sourcetap

sourcetap

sinktap

traptap

mobileAPI

Cascading app

customer profile

DBsCustomer

Prefs

web logsweb

logsgpstracks

Recommender

Rediscluster

Customers

Supportreview

sourcetap

web logsweb

logsGISexport

webapp

Splunk

integration

Batch workflow plus a data service:

59Monday, 28 January 13

Page 60: Using Cascalog to build an app based on City of Palo Alto Open Data

integration

In terms of deploying a batch workflow, there are several considerations:

•build package for a “fat jar” (lein uberjar)

• continuous integration

• JAR repository

• cluster scheduling (e.g., EMR)

• instrumentation (Concurrent)

• troubleshooting from app layer

60Monday, 28 January 13

Page 61: Using Cascalog to build an app based on City of Palo Alto Open Data

apps

source: Apple

61Monday, 28 January 13

Page 62: Using Cascalog to build an app based on City of Palo Alto Open Data

apps

We work on discovery, modeling, integration – long before coding an app. In a linear-logical sense, one might prefer a “waterfall” approach; however, that would undermine core values – mitigating Accidental Complexity – TDD, scalability, fault-tolerance, etc.

In lieu of SQL queries, we define a composable set of logical propositions which can be executed, instrumented, tested, etc., independently for best practices at scale in parallel

Back to functional relational programming, particularly Datalog’s logic programming, we use subqueries as logical propositions… within a functional context… to leverage the relational model

• scalability: specify what you require, not how

• testability: disprove the opposites of propositions, to validate

Taken together in the context of Cascalog, now let’s build the app…

62Monday, 28 January 13

Page 63: Using Cascalog to build an app based on City of Palo Alto Open Data

apps

(defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [

[org.apache.hadoop/hadoop-core "0.20.2-dev"]]}}

)

63Monday, 28 January 13

Page 64: Using Cascalog to build an app based on City of Palo Alto Open Data

apps

64Monday, 28 January 13

Page 65: Using Cascalog to build an app based on City of Palo Alto Open Data

‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔

apps

(results)

65Monday, 28 January 13

Page 66: Using Cascalog to build an app based on City of Palo Alto Open Data

apps

M

gps

Countgps_count

R

Geohash

gpslogs

Maxrecent_visit

M

road

RoadMetadata

Join EstimateAlbedo Geohash

Regexparse-road

RoadSegments

R

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

M

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

Join

R reco

(flow diagram, for the

whole enchilada)66Monday, 28 January 13

Page 67: Using Cascalog to build an app based on City of Palo Alto Open Data

Design principles in the Cascading API pattern language, which help ensure best practices for Big Data apps in an Enterprise context:

• specify what is required, not how it must be achieved

• provide the “glue” for system integration

• same JAR, any scale

• users want no surprises

• fail the same way twice

• plan far ahead

These points echo arguments about functional relationalprogramming (FRP) and Accidental Complexity from Moseley/Marks 2006

definitions

67Monday, 28 January 13

Page 68: Using Cascalog to build an app based on City of Palo Alto Open Data

systems

source: Wired

68Monday, 28 January 13

Page 69: Using Cascalog to build an app based on City of Palo Alto Open Data

principle: same JAR, any scale

Your Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes

Staging Cluster:Gb’s dataEMR + a few Spot InstancesCI shows red or green lightsruntime: minutes – hours

Production Cluster:Tb’s dataEMR w/ many HPC InstancesOps monitors resultsruntime: hours – days

MegaCorp Enterprise IT:Pb’s data1000+ node private clusterEVP calls you when app failsruntime: days+

69Monday, 28 January 13

Page 70: Using Cascalog to build an app based on City of Palo Alto Open Data

systems

#!/bin/bash -ex# edit the `BUCKET` variable to use one of your S3 buckets:BUCKET=temp.cascading.org/copaSINK=out # clear previous output (required by Apache Hadoop)s3cmd del -r s3://$BUCKET/$SINK# load built JAR + input datas3cmd put target/copa.jar s3://$BUCKET/s3cmd put -r data s3://$BUCKET/ # launch cluster and runelastic-mapreduce --create --name "CoPA" \ --debug --enable-debugging --log-uri s3n://$BUCKET/logs \ --jar s3n://$BUCKET/copa.jar \ --arg s3n://$BUCKET/data/copa.csv \ --arg s3n://$BUCKET/data/meta_tree.tsv \ --arg s3n://$BUCKET/data/meta_road.tsv \ --arg s3n://$BUCKET/data/gps.csv \ --arg s3n://$BUCKET/$SINK/trap \ --arg s3n://$BUCKET/$SINK/park \ --arg s3n://$BUCKET/$SINK/tree \ --arg s3n://$BUCKET/$SINK/road \ --arg s3n://$BUCKET/$SINK/shade \ --arg s3n://$BUCKET/$SINK/gps \ --arg s3n://$BUCKET/$SINK/reco

70Monday, 28 January 13

Page 71: Using Cascalog to build an app based on City of Palo Alto Open Data

systems

71Monday, 28 January 13

Page 72: Using Cascalog to build an app based on City of Palo Alto Open Data

Apache

Wikipedia

‣ name node / data node

‣ job tracker / task tracker

‣ submit queue

‣ task slots

‣ HDFS

‣ distributed cache

(under the

hood)

systems

72Monday, 28 January 13

Page 73: Using Cascalog to build an app based on City of Palo Alto Open Data

bucketlist

bucketlist

73Monday, 28 January 13

Page 74: Using Cascalog to build an app based on City of Palo Alto Open Data

Could combine this with a variety of data APIs:

• Trulia neighborhood data, housing prices

• Factual local business (FB Places, etc.)

• CommonCrawl open source full web crawl

• Wunderground local weather data

• WalkScore neighborhood data, walkability

• Data.gov US federal open data

• Data.NASA.gov NASA open data

• DBpedia datasets derived from Wikipedia

• GeoWordNet semantic knowledge base

• Geolytics demographics, GIS, etc.

• Foursquare, Yelp, CityGrid, Localeze, YP

• various photo sharing

74Monday, 28 January 13

Page 75: Using Cascalog to build an app based on City of Palo Alto Open Data

Data Quality: some species names have spelling errors or misclassifications – could be cleaned up and provided back to CoPA to improve municipal services

Assumptions have been made about missing data – were these appropriate for the intended use case?

There are better ways to handle spatial indexing: k-d trees, etc.

The tree data product needs: photos, toxicity, natives vs. invasives, common names, etc.

75Monday, 28 January 13

Page 76: Using Cascalog to build an app based on City of Palo Alto Open Data

Arguably, this is not a “large” data set:

• Palo Alto has 65K population

• great location for a POC

• prior to deploying in large metro areas

• CoPA is a leader in e-gov

• app is simpler to study on a laptop

Could extend to other cities with Open Data initiatives: SF, SJ, PDX, Seattle, VanBC…

Let’s get coverage for all of Ecotopia!

76Monday, 28 January 13

Page 77: Using Cascalog to build an app based on City of Palo Alto Open Data

Trulia: optimize sales leads using estimated allergy zones, based on buyers’ real estate preferences

Calflora: report new observations of invasives endangered species, etc.; infer regions of affinity for releasing beneficial insects

City of Palo Alto: assess zoning impact, e.g., oleanders near day care centers; monitor outbreaks of tree diseases (big impact on property values)

start-ups: some invasive species are valuable in Chinese medicine while others can be converted to biodiesel – potential win-win for targeted harvest services

77Monday, 28 January 13

Page 78: Using Cascalog to build an app based on City of Palo Alto Open Data

summary points

• geo data is great for municipal infrastructure and for mobile apps

• Cascading as a pattern language for Enterprise Data Workflows

• design principles in the API/pattern language ensure best practices

• focus on the process of structuring data; not un/structured

• Cascalog subqueries as composable logical propositions

• FRP mitigates the engineering costs of Accidental Complexity

• Data Science process: discovery, modeling, integration, apps, systems

• Hadoop is rarely ever used in isolation; breaking down silos is the hard problem, which must be socialized to resolve

78Monday, 28 January 13

Page 80: Using Cascalog to build an app based on City of Palo Alto Open Data

references

by Paco Nathan

Enterprise Data Workflowswith Cascading

O’Reilly, 2013amazon.com/dp/1449358721

Santa Clara, Feb 28, 1:30pmstrataconf.com/strata2013

80Monday, 28 January 13

Page 81: Using Cascalog to build an app based on City of Palo Alto Open Data

blog, code/wiki/gists, maven repo, community, products:

cascading.org

github.org/Cascading

conjars.org

meetup.com/cascading

goo.gl/KQtUL

concurrentinc.com

drill-down

we are hiring! Copyright @2013, Concurrent, Inc.

81Monday, 28 January 13