11
MapMyCab Preetika Kulshrestha Insight Data Engineering, Feb 2015

MapMyCab Presentation

Embed Size (px)

Citation preview

MapMyCabPreetika Kulshrestha!

Insight Data Engineering, Feb 2015

Motivation• Tool for Data Scientists and Cab dispatchers to analyze (by

time of day or day of week):!

• cab occupancy!

• miles travelled!

• pickups and drop-offs!

• An app for city dwellers to view real-time cab status for unoccupied cabs in a given area

Demo

Pipeline

Cab Data

Message Broker

Real-Time Streaming

HDFS

HBase UI

MrJob

11 million rows

Data Aggregation CabID Lat Long Occ Timestamp

Aggregate Metrics (per cab)

MrJob

year month day hour avocc pickup drop off

• Drop off event: Occupancy change from 1 to 0!

• Pickup event: Occupancy change from 0 to 1

Computing Trip Durations and Shift Times

• Used Windowing function in Hive to calculate idle times!

• Maximum idle time in a day points to a potential shift!

• 1 million trips

idle/shift time!(hours)

tripId hour idle (s) idle (h)

Occupancy Profile

occ (

%)

0

0.175

0.35

0.525

0.7

hour

0 1 2 3 4 5 6 7 8 9 10 11 13 12 14 15 16 17 18 19 20 21 22 23

potential !shift time!

Tables

• Hourly data organized by Day of Week!

• Aggregate metrics stored in the same table for fast retrieval

y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals

Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..

2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. ..

sum(pickups), sum(drop offs), avg(occ), avg(dist)

Hourly Aggregates by Day of Week

• HBase row level atomicity can be leveraged for transactional operations!

• Keyed producer in Kafka assures in-order delivery of messages (by key)!

• Simple operations for tool integration, followed by incremental complexity streamlines the development process

Takeaways

About Me• Previous Life - Senior Energy Analyst

(EnerNOC Inc.).

• M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid).

• https://github.com/PreetikaKuls

[email protected]

Batch Views

Batch Views