Upload
frederica-gibbs
View
229
Download
0
Tags:
Embed Size (px)
Citation preview
PORTAL Data Quality & Aggregation
Sue Ahn, Kristin Tufte
Road Map Initial Data Cleaning (more cleaning later)
Set null values as appropriate Set Data Quality Flags based on row-level data
integrity violations
Aggregate Calculations Calculate 5-minute aggregates Calculate 15-min, 1-hour aggregates
Portal Output Calculations Timeseries Grouped Data Performance
Road Map: Initial Data Cleaning
Set NULL values Data Quality Flags
Set Null Values as Appropriate Case 1: No Data Received from Detector
ODOT indicates vol = occ = speed = -1, status = 0 means no data received from detector ** (status 0 is inhibited)
We set volume, occupancy and speed to NULL in this case
Case 2: No Activity at Detector ODOT indicates vol = occ = speed = 0, status = 0 means no activity at
detector ** We set speed to NULL in this case
Case 3: Disabled (?) Also get readings: vol = occ = speed = -1, status = 1 (status 1 is disabled) Similar to Case 1, except status = 1, so assume this does mean disabled We set volume, occupancy and speed to NULL in this case Should verify the meaning of vol = occ = speed = -1, status = 1 with ODOT
** as of 2-14-05; fields were different before No -999’s in data received live; there are -999s in 2001, 2002 data
Data Quality Flags
Detect data inconsistencies and erroneous values
Row-level/Reading-level flags Flags are set for each (vol, occ, speed) reading we
get For each reading (row), a flag is set to indicate if that
row fails a particular criteria Examples
Max Speed, set if speed > 100 mph Max Occ, set if occ > 95%
Adapted from TTI (see wiki for link)
Data Quality Flags – Single Value Flags Maximum Volume (DQ_MAXVOL)
Condition: volume > 17 (20-sec data) Comments: Equivalent to 3060 vph; may be caused by
motorcycles or lane changes for example Maximum occupancy (DQ_MAXOCC)
Condition: occupancy > 95% Comments: detectors may be “stuck-on”
Maximum speed (DQ_MAXSPD) Condition: speed > 100 mph Comments: based on empirical evidence
Minimum speed (DQ_MINSPD) Condition: speed < 5 mph Comments: based on empirical evidence
Data Quality Flags – Multi-variate Consistency Multi-variate consistency #1 (DQ_MVC_S0VGT0)
Speed = 0 and Volume > 0 Indicates potential problem with: speed Is this necessary? We already eliminate speed < 5 mph
Multi-variate consistency #2 (DQ_MVC_SGT0V0) Speed > 0 and Volume = 0 Indicates potential problem with: speed
Multi-variate consistency #3 (DQ_MVC_OGT0V0) Occupancy > 0 and Volume = 0 Note: alternative: set occupancy = 0 Indicates potential problem with: occupancy
Road Map: Aggregate Calculations
5-minute aggregates 15-min, 1-hr aggregates
Missing Values in 5-min Aggregates of Vol, Speed, Occ
Raw AggregatesSum/avg data with no imputation, rescaling or
cleaning Processed Aggregates
Two issues:#1 Is there enough data to calculate an
aggregate?#2 If there is enough data, how to handle the
fact that we are missing values?
Missing Values in 5-min Aggregates of Vol, Speed, Occ, Cont’d
Issue #1: Determining if a time period has enough valid data to calculate aggregates for the time periodAggregate vol, speed, occ set to null if less
than X% of the readings are validDo sensitivity analysis to pick X (start with
75% - set 5-min aggregate to null if we have less than 75% of the data)
Bad Data Threshold Rob - In general, throughout the development of the aggregation process,
we consistently encountered the issue of how to determine when there is enough data to create an aggregate vs. when an aggregate value should be declared null because of lack of data.
For starters, we’ve chosen a threshold of 75%. If 75% or more of the 20-second readings in a 5-minute interval are valid, we calculate an aggregate for that 5-minute interval.
The same issue arises in the following situations: Calculating 5-min aggregates from 20-second data Calculating 15-min aggregates from 5-min data Calculating 1-hr aggregates from 15-min data Grouped Data: grouping over multiple days (group by hour) Grouped Data: aggregating 1-hr data to daily data for group by day of week and
week of year Related issues:
Aggregating over lanes (how to handle the situation when data is missing from one lane, note that data may be missing from a lane for only a portion of a day)
We wonder if studying this percentage (data sensitivity) would make a good paper topic.
#2 Handling Missing Values
Two basic methods for dealing with missing values 1) Use preceding/following values (replace missing volume with
preceding volume reading for that detector) 2) Aggregate values in a 5 minute range and scale based on # of
readings Since we are aggregating, suggest method 2
We are aggregating, so interpolation may not be necessary Think 2 will perform similarly to 1 in cases when only a few
readings are missing, when lots of readings missing, should report that data is not available (null)
Method 2 is well-defined (simple to define and express/explain) Method 2 is easier to implement than 1
5-min Aggregates (vol, spd, occ)
Volume For each detector, sum all valid volume readings in 5-
minute period (eliminate any readings with non-zero flags) and rescale by # of missing readings
Speed, Occupancy For each detector, calculate the volume-weighted
average of all valid readings in a 5-minute period Assume 20-second readings are space-mean speed Weight by volume (to be technically correct, we need to
weight by density, but that requires estimating density)
5-Min Aggregate – Other Measures
Vehicle Miles Traveled (VMT) 5-min VMT = 5-min volume * length of midpoint
influence area of associated detector
Vehicle Hours Traveled (VHT) 5-min VHT = 5-min volume * (length of midpoint
influence area / 5-min space-mean speed)
Travel Time 5-min travel time = length of midpoint influence area /
(5-min space-mean speed)
5-min Aggregates – Other – Cont’d
Total Delay (in veh-mins) 5-min delay = 5-min volume * (travel time – (influence
area / free-flow speed)) Where: free-flow speed = min {assumed free-flow
speed, actual 5-min speed} Note: Free-flow speed is computed in this way in
order to avoid negative delay Average Delay/Vehicle (in minutes)
5-min delay = Total Delay/5-min volume Should we report Average or Total Delay or
both?
Summary: Which values/measures affect which calculations?
20-sec volume
20-sec speed
20-sec occupancy
5-min volume
5-min speed
5-min occupancy
5-min VMT
5-min Travel Time
5-min Delay
5-min VHT
**
* May be needed, depends on method used
** Need 5-min speed to calculate free-flow speed
*
Road Map: 15-min, 1-hr Aggregates
Issues:Calculated from 5-min aggregates or from 20-
second data? We chose to use 5-min data
What should the threshold be for % of bad data?
Chose to toss 15-min reading if we are missing more than one 5-min reading
For now, toss 1 hr reading if we have less than 75% of 5-min readings (or less than 9 readings)
Note the “bad data” threshold issue arises here, also see next page.
Aggregation over Larger Intervals
15-min aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than one 5-minute reading
1-hr aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than three 5-minute readings
Aggregation over Time, Cont’d Volume
Sum 5-min volumes Occupancy
Volume-weighted average of 5-min occupancies Speed
Volume-weighted average of 5-min speeds Vehicle Miles Traveled
sum of 5-min VMTs Vehicle Hours Traveled
sum of 5-min VHTs Travel Time
Volume-weighted average of 5-min travel times Delay
Total Delay: Sum of 5-min delays Average Delay: Volume-weighted average of 5-min delays
Road Map: PORTAL Output
We have described how to do the 5-min, 15-min and 1-hr aggregations
For each page in PORTAL, analyze what calculations are needed and how those calculations should be done
Timeseries – Line Graphs Timeseries
Options: graph by station or lane, 7 basic measures, 5-min, 15-min, 1-hr granularity of input data
For Lane Graph These values are in aggregate table (calculation described in
previous slides) Requires 5-min (15-min, 1-hr) aggregates of each value for
each detector If data is missing, put blank space in the graph
For Station Graph: Requires aggregation over lanes Calculation described in subsequent slides If data is missing from one or more lanes, show blank space
on graph
Timeseries – Contour Plots Contour Plot (Surface Plot?) for a Highway
Requires same calculations as for station plot on previous slide
Aggregation over Lanes Volume
Sum Occupancy
Volume-weighted average of by-lane occupancies Speed
Volume-weighted average of by-lane speeds Vehicle Miles Traveled
Sum over lanes Vehicle Hours Traveled
Sum over lanes Travel Time
Volume-weighted average of by-lane travel times Average Delay
Volume-weighted average of by-lane delays Total Delay
Sum of by-lane delays
Grouped Data – Line Plots Line Plots by Station/Detector
Group by: 5 minute (minute of day) Input is 5-minute aggregations Simple arithmetic average, min, max, and std dev of data
from appropriate 5-min period Result shows variation between days
Group by: Hour of day Input is 1-hr aggregations As above, simple arithmetic average, min, max and std dev
of data from appropriate 1-hr periods Result shows variation between days
What to do if we are missing data? What if we are missing 50% of the values for 8:00-9:00, for example? We may be missing different %s for different hour intervals??
Grouped Data – Line Plots, Cont’d Line Plots by Station/Detector
Group by: Day of Week Input must be 1-hr aggregations Aggregate 1-hr data to daily data What if one or more hours in a day is/are missing?? (What if
we are missing a peak hour?) Then simple arithmetic avg, min, max, std dev of aggregates Result shows variation within day of week
Group by: Week of Year Similar to day of week
Note: “bad data threshold” issue again. Seems reasonable to rescale when going from 20-sec data to 5-min data, and again 5-min to 1-hr seems OK. But neither interpolating or rescaling seem really acceptable in this situation.
Aggregation to Day of Week Volume
Sum 1-hr volumes Occupancy
?? Volume-weighted average of 5-min occupancies Speed
( 1-hr VMTs)/( 1-hr VHTs) ensure for VMT and VHT - either are both null or neither are null
Vehicle Miles Traveled 1-hr VMTs Rescale???
Vehicle Hours Traveled 1-hr VHTs
Travel Time Volume-weighted average of 5-min travel times
Delay Total Delay: Sum of 5-min delays Average Delay: Volume-weighted average of 5-min delays
WARNING: STOP HERE
Following slides are under construction (hard hats required)
Grouped Data – Highway Plots Highway Plots
What does this do anyway??? Group by: 5 minute (minute of day)
Requires input to be 5-minute aggregate data Need to group 5-minute periods from different days together Use appropriate aggregation-by-time method
Group by: Hour of day Input can be 5-min, 15-min or 1-hr For 1hr input need to group 1-hr periods from different days
together Use appropriate aggregation-by-time method
5-min, 15-min input – Should we restrict this to 1-hr input?? Just use appropriate aggregation-by-time method??
Grouped Data – Highway Plots, Cont’d Highway Plots
Group by: Day of Week Input is 5-min, 15-min, 1-hr Need to group 5-min data into 5-min groups, 15-
min data into 15-min groups, etc. Use appropriate aggregation by time method??
Group by: Week of Year Same as Day of Week, I think
Grouped Data – All Highways
All Highways Plots
Performance Reports MONTHLY FIGURES
Avg TT Avg TS Avg VS FFTT FFTS 95th% TT 95th% TS 166.11 Mins 52.05 MPH 53.79 MPH 145.81 Mins 59.29 MPH 232.16
Mins 37.24 MPH VMT VHT % Cngst Hwy Len Days Used 211,298,304 4,118,289 15.09% 144.09 Miles 31 WEATHER NUMBER OF DAYS Rain ≥ 0.5" Rain ≥ 1.0" Frzn Precip Fog 8 2 0 1
TIME WINDOW FIGURES (primarily for daily report – BI, TTI are new) Avg TT Avg TS 95th% TT 95th% TS 166.11 Mins 52.05 MPH 232.16 Mins 37.24 MPH VMT VHT % Cngst TTI BI 211,298,304 4,118,289 15.09% 1.59 1.97
Future Flags
Calibration Error (not in this presentation, see paper) Multi-variate consistency
Volume > nmax when Occupancy = 0 This should be tested after correcting for calibration error
Maximum reduction in speed Speedn+1 > 0.45*Speedn
Identical records If more than 8 consecutive records (volume, occupancy, and
speed) are identical
Extra Slides
Questions
No vehicles present Speed = 0 when Volume = 0Set Speed = null
Am not doing this because of the -1, -1, -1, and 0, 0, 0 tests
At what point do we call a 5-minute interval invalid?? 5 missing readings? 8? 9? 10?
Aggregation over space – we need something for vol and
Aggregation over space – need to deal with missing data Need aggregation over time (i.e. combining data from
different days together)
ODOT SWARM Flags
0 - inhibited 1 - disabled 2 - ok 3 - suspect 4 - softfail 5 - hardfail
Data Notes Status 0 Received Data As of Feb 14, 2005, ODOT reports vol -1, occ -1,
speed -1, status 0, for detectors with no controller communications (no data received from detector), and vol 0, occ 0, speed 0, status 0 for detectors reporting no traffic data (no activity) (info from David Gaarsoe). See Data Fidelity and minus 1 and minus 999 for more info.
Data Cleaning The nightly scripts "clean" the -1 and -999 data as follows: Sets volume = 0, speed = NULL, occupancy = 0 where volume = -999 Sets volume = NULL, speed = NULL, occupancy = NULL, where volume
= -1 I don’t belive we receive -999 data anymore (KT)
Unknown We seem to also get vol = 0, speed = 0, status = 2 (or other status), with
occ > 0. What does this mean? What was that weird thing I noted in the file on badger?
Steve H’s notes from emails from ODOT -999
When we received blocks of data from 2001 and 2002, the 20-second data included records where volume=-999. This indicated that no cars passed. As of July 14, 2004 when PORTAL went live, we no longer received volume= -999. -999 are no longer an issue.
-1 From July 14, 2004 to November 17, 2004, we were receiving records with volume=-1. In all cases, when volume=-1, the SWARM algorithm variable status=2, which is the code for "OK." As I understand it, this is supposed to represent a communication error.
From November 18, 2004 to February 13, 2005, we did not receive any records where volume = -1. This is due to "Transport ATMS build 2.1, distributed on Nov 17, included many internal changes to traffic sensor and detector station data, (including station level data smoothing). Among other changes it started sending a 0 detector volume during Communications Failure with the Wapiti controllers, as noted by Robert Bertini." -David Gaarsoe
During this time period, it is impossible to distinguish between communication errors and periods where no cars pass.
Beginning on February 14, 2005, it appears that Transport ATMS build 2.3 was released. "Transport ATMS build 2.3 will ... once again report a -1 vol, -1 occupancy, and -1 speed, Inhibited status (0) for Traffic Detectors with no controller communications, and a 0 vol, 0 occupancy, 0 speed, Inhibited status for detectors of reporting controllers with no traffic data (no activity)." -David Gaarsoe
It appears that, beginning on February 14, 2005 we are now able to accurately distinguish between communication errors and periods where no vehicles pass a detector.
http://wiki.its.pdx.edu/wiki/portal/moin.cgi/ODOT_-1_Email
Data Quality Flags - Implementation
Queries or scripts analyze the data and set flags when criteria are met
We will be able to tell which data values failed which criteria
Aggregation over Lanes/Corridors
Volume Over lanes: Sum (average volume over detector stations doesn’t seem useful)
Occupancy Over lanes: Weighted average of by-lane occupancies by
volumes (average occupancy over detector stations doesn’t seem useful)
Speed Over lanes: Weighted average of by-lane speeds by volumes Over detector stations: Weighted average of speeds by volumes
(?)
Aggregation over Space, Cont’d
Vehicle Miles Traveled Sum over lanes or loop detector stations
Vehicle Hours Traveled Sum over lanes or loop detector stations
Travel Time Over lanes: Weighted average of travel time by volumes Over loop detector stations: Sum of segment-wide travel times
Delay Over lanes: Weighted average of travel time by volumes Over loop detector stations: Sum of segment-wide travel times