26
Colorado State Address Dataset Colorado State Address Dataset Data Quality Nathan Lowry, GIS Outreach Coordinator State of Colorado September 24, 2014 September 24, 2014

Lowry colorado state address dataset data quality

Embed Size (px)

Citation preview

Page 1: Lowry colorado state address dataset data quality

Colorado State Address Dataset Colorado State Address Dataset Data Quality

Nathan Lowry, GIS Outreach CoordinatorState of Colorado

September 24, 2014 September 24, 2014

Page 2: Lowry colorado state address dataset data quality

Data QualityTwo Tracks:

1 Develop criteria and measure quality1.Develop criteria and measure quality•Develop quality measures in relation to ISO standards

•Draw from measures in standards and practice•Draw from measures in standards and practice

2.Compare for potential corrective actions•Master Street Address Guide (MSAG) and ALI•Master Street Address Guide (MSAG) and ALI

•US Postal Service Address Quality Improvement (CASS)

•Statewide Voter Registration System (SCORE)g y ( )

•Motorist Insurance Identification Database (MIIDB)

Page 3: Lowry colorado state address dataset data quality

ISO 19157 Geographic information - Data quality

Defines comprehensive definitions and testing guidance to measure data quality:completeness: presence and absence of features, their attributes and relationships

•commission: excess data present in a dataset

•omission: data absent from a datasetf

logical consistency: degree of adherence to logical rules of data structure, attribution and relationships (data structure can be conceptual, logical or physical)

•conceptual consistency: adherence to rules of the conceptual schema

•domain consistency: adherence of values to the value domains

•format consistency: degree to which data is stored in accordance with the physical structure of the datasetformat consistency: degree to which data is stored in accordance with the physical structure of the dataset

•topological consistency: correctness of the explicitly encoded topological characteristics of a dataset

positional accuracy: accuracy of the position of features

•absolute (or external) accuracy: closeness of reported coordinate values to values accepted as or being true

•relative (or internal) accuracy: closeness of the relative positions of features in a dataset to their respective relative positions accepted as or being trueaccepted as or being true

•gridded data position accuracy: closeness of gridded data position values to values accepted as or being true.

temporal quality: accuracy of the temporal attributes and temporal relationships of features

•accuracy of a time measurement: correctness of the temporal references of an item (reporting of error in time measurement)

•temporal consistency: correctness of ordered events or sequences, if reported

•temporal validity: validity of data with respect to time•temporal validity: validity of data with respect to time

thematic accuracy: accuracy of quantitative attributes and the correctness of non-quantitative attributes and of the classifications of features and their relationships.

•classification correctness: comparison of the classes assigned to features or their attributes to a universe of discourse (e.g. ground truth or reference dataset)

•non-quantitative attribute correctness: correctness of non-quantitative attributes

Governor's Office of Information Technology ~ Executive Leadership Team

•non-quantitative attribute correctness: correctness of non-quantitative attributes,

•quantitative attribute accuracy: accuracy of quantitative attributes

Page 4: Lowry colorado state address dataset data quality

Determining Sampling Size

Sample Size and Confidence Interval Tutorial● The confidence interval (commonly referred to as the margin of error or error rate) is the plus-or-minus

figure you hear mentioned relative to surveys or opinion polls. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer. Most researchers prefer a confidence interval of less than 4 percentage points.

● The confidence level tells you how sure you can be. Expressed as a percentage, it represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level.

● When you put the confidence level and the confidence interval together, you can say (for example) that you are 95% sure that the true percentage of the population is between 43% and 51%.

● The wider the confidence interval (higher margin of error) you are willing to accept, the more certain you can be that the whole population answers would be within that range. For example, if you asked a

l f 1000 l i it hi h b d f l th f d d 60% id B d A b sample of 1000 people in a city which brand of cola they preferred, and 60% said Brand A, you can be very certain that between 40 and 80% (80% confidence interval) of all the people in the city actually do prefer that brand. However, you cannot be so sure that between 59 and 61% (99% confidence interval) of the people in the city prefer the brand.

Governor's Office of Information Technology ~ Executive Leadership Team

Page 5: Lowry colorado state address dataset data quality

Data Quality - Sampling Size Q y p g

Page 6: Lowry colorado state address dataset data quality

Data Quality - Sampling Size Q y p g

Page 7: Lowry colorado state address dataset data quality

Data Quality - Sampling Size Q y p gWith a confidence interval of 3 percentage points and a 95 % confidence level:

Page 8: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g1. Randomly select 5 address points

2. Select road segments associated with address pointsg p

3. Select adjacent connected road segments

4. Select the address points associated with the selected proad segments

5. Repeat steps 3 & 4 until sample size is exceeded

Page 9: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g1. Randomly select 5 address points

Page 10: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g2. Select road segments associated with address points

Page 11: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g3. Select adjacent connected road segments

Page 12: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g4. Select the address points associated with the selected road segments

Page 13: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g5. Repeat steps 3 & 4 until sample size is exceeded

Page 14: Lowry colorado state address dataset data quality

Data Quality - Sampling Method Q y p g5. Repeat steps 3 & 4 until sample size is exceeded

Page 15: Lowry colorado state address dataset data quality

Data Quality – Sampling Method

Governor's Office of Information Technology ~ Executive Leadership Team

Page 16: Lowry colorado state address dataset data quality

Data Quality: The DPS-1 Universe

Governor's Office of Information Technology ~ Executive Leadership Team

Page 17: Lowry colorado state address dataset data quality

Data Quality: DPS-1 Sample Sites

Governor's Office of Information Technology ~ Executive Leadership Team

Page 18: Lowry colorado state address dataset data quality

Data Quality: DPS-1 Sample Sites 1 & 2

Governor's Office of Information Technology ~ Executive Leadership Team

Page 19: Lowry colorado state address dataset data quality

Data Quality: DPS-1 Sample Sites 3, 4, and 5

Governor's Office of Information Technology ~ Executive Leadership Team

Page 20: Lowry colorado state address dataset data quality

Data Quality - Completeness•Omissions – Correct location which is missed (a point present in OIT data but missing in DPS

data)•Commission – A location point created in error (a point present in DPS data which does not

Q y p

exist in OIT data)•Omissions and Commissions are defined based on assumption that OIT data is correct

Results•We weight the omissions and commissions equally using this formula –•We weight the omissions and commissions equally using this formula

0.5(OmissionPct) + 0.5(CommissionPct) = Overall Percent Score

•Apartments Only = 70.91%•Houses and Commercial = 89.59%Houses and Commercial 89.59%•All = 75.68%

•Not reflective of all DPS addressesaddresses•Apartment inaccuracies sway aggregate percentage heavily•Apartments greatest area of concern

Governor's Office of Information Technology ~ Executive Leadership Team

co ce

Page 21: Lowry colorado state address dataset data quality

Data Quality - Positional Accuracy•OIT locations and DPS locations compared spatially•Line segments are created to link DPS location to its correlating OIT point•Severe errors primarily present in apartment locations.•Few true house errors, most are inconsistent of OIT points due to use of Laser Range Finders

Issues with Apartments•Stacking – Many apartments stacked on top of each other in one location•Consequently, lack of spatial differentiation is present•Spatial inaccuracy is significant•Spatial inaccuracy is significant

1.7308 * SQRT( ([∆1]2 + [∆2]2 + [∆3]2 + … + [∆n]2)/n )Where –● 1.7308 = Standard Error in the Horizontal

∆ Di (F )● ∆ = Distance (Feet)● n = Number of Distances

Houses and Commercial = 38 feet horizontal accuracy at 95% confidence

Apartments= 125 feet horizontal accuracy at 95% Apartments= 125 feet horizontal accuracy at 95% confidence

All = 105 feet horizontal accuracy at 95% confidence

•We again see the apartments swaying the overall

Governor's Office of Information Technology ~ Executive Leadership Team

results, while houses feature far less error

Page 22: Lowry colorado state address dataset data quality

Data Quality - Logical Consistency•DPS points are geocoded to Denver Public Road data.•Lines are used to link geocoded points to respective actual DPS points•Goal is to identify logical consistency errors in DPS data, however –•Points are geocoded to Denver Public Roads data, thus errors could arise from either side

•Sequential Error –These are errors in the sequence/order of address numbers

•Parity Error – These are errors in odd/even •Parity Error These are errors in odd/even positioning of address locations and address numbers

•Out-Of-Range – These are errors in the placement of the address point well beyond the range allowed in the road centerline data for the same section of road sampled

•Anomaly – An inconsistency in the sequence, parity, or range of an address point, but which is not i i t t ith ifi d fi ld linconsistent with verified field values

Governor's Office of Information Technology ~ Executive Leadership Team

Page 23: Lowry colorado state address dataset data quality

Data Quality - Temporal Quality•Temporal quality assesses the frequency and types of modifications and updates made to the

data set•11-16-2010 to 10-24-2013•Improvements can still be made•It is important to begin tracking timing difference between data collection and data updating

Governor's Office of Information Technology ~ Executive Leadership Team

Page 24: Lowry colorado state address dataset data quality

Data Quality - Thematic Accuracy•Thematic errors are errors present in the attribution of each point•Example – 310 Blake Street is in fact 312 Blake Street

•9 errors were found in private home neighborhoods of which there are 436

•Selected sample of 25 errors from apartment locations•15 Duplicates•6 Incomplete Addresses•4 Does Not Exist

•Apartments again the biggest culprit•Further investigation into the level of thematic error in apartment complexes may be

necessary to sufficiently characterize the quality, but may also be significantly ambiguous•Best guess in determining whether it is a thematic error (wrong address number) or another

type of error (positional accuracy error, logical consistency anomaly) is suspect esp. /

Governor's Office of Information Technology ~ Executive Leadership Team

w/apartments

Page 25: Lowry colorado state address dataset data quality

Data Quality - Thematic Accuracy

Governor's Office of Information Technology ~ Executive Leadership Team

Page 26: Lowry colorado state address dataset data quality

Questions?

Thank You!