Upload
blake-regalia
View
96
Download
2
Embed Size (px)
Citation preview
Moon Landing or Safari?
A Study of Systematic Errors and their Causes in Geographic Linked Data
Krzysztof Janowicz1, Yingjie Hu1, Grant McKenzie2, Song Gao1, Blake Regalia1,
Gengchen Mai1, Rui Zhu1, Benjamin Adams3, and Kerry Taylor4
2016/10/01
1STKO Lab, University of California, Santa Barbara, USA
2Department of Geographical Sciences, University of Maryland, USA
3Centre for eResearch, The University of Auckland, New Zealand
4Australian National University, Australia
Linked Data
Linked Data: representing data as collections of intra & inter-linking graphs.
The nodes and edges of the graphs are Internationalized Resource Identifiers
(IRIs). It is built upon the Resource Description Framework (RDF); enabling
Web docs & services to share structured data about anything.
Linked Data Significance
Linked Data is already in very wide use; it powers many ‘smart’ query services.
It is revolutionizing data publishing and retrieval.
Linked Data Significance
The Linked Data cloud grows every year, but it suffers from: data quality
issues, limited availability, and lack of data persistence. Data quality and
maintenance are known to be the most difficult issues facing data publishers.
Geographic Linked Data
Geographic data is one of the primary nexuses for structured data on the
world-wide web.
Data Scientists
As Geographic Information Scientists, it is our responsibility to:
• assess the quality of structured geo-data on the web
• discover systematic errors
• identify their root causes
• and publish our recommendations for best practices
Our motivation is only to improve data quality, not to criticize others for falling
victim to these errors.
Most of these errors are common. They tend to arise from easily overlooked
qualities of geographic information.
Errors
We have broken down systematic errors into the following categories:
1. Triplification and Extraction
2. Improper use of ontologies / Limited understanding of domain
3. Designing new ontologies / Oversimplified conceptual models
4. Data accuracy / Lack of ‘uncertainty’ framework
Triplification
(1) “Triplification” typically refers to the transformation of flat data into RDF.
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
1http://wit.istc.cnr.it/stlab-tools/fred/demo [email protected]
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
1http://wit.istc.cnr.it/stlab-tools/fred/demo [email protected]
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
1http://wit.istc.cnr.it/stlab-tools/fred/[email protected]
Triplification Errors
(1) and (2) are both liable to the same types of errors that can occur during
the extraction & conversion of the source data from its original format.
Time to investigate for errors! How does one begin searching for systematic
errors in world-wide geographic data? By using a map!
World Map Image
(In regards to the previous slide):
No base-map on this image; yet you can clearly recognize this is a map of the
world. The Linked Data cloud has a high spatial coverage!
The large “X” in the center of the map can be blamed on a parsing error. This
can happen when one of a coordinate’s decimal values is reused for the latitude
or longitude; this has the effect of locating a point at (X, X) or (Y, Y).
World Map Image
(In regards to the previous slide):
Notice the grid-like structure in Russia; we see a reguarly-spaced snapping of
points. Those are the results of decimal truncation; a process that forces a
floating-point value into an integer.
Lastly, see those ghostly images of land masses where there shouldn’t be land
masses? These are reflections of New Zealand and Australia mirrored about the
Equator; we also found evidence of horizontal mirroring as well. Two
explanations for this: (1) Negative signs (or a lack thereof); (2) Improper
parsing of Quadrant identifiers; e.g., Oeste starts with an ‘O’ (Spanish word for
West) but parsing throws this out and longitude gets flipped onto the other
side of the globe.
Problem Essentials
Lessons Learned:
If triplification software does not account for full range of variations,
unexpected geometries may occur.
Coordinate discrepancy rectangularization2
2http://dbpedia.org/page/Solar_Star [email protected]
Ontology Use & Domain Errors
Ontology Fertility
Apparently, the location of the Moon Landing event took place in Algeria. So
what’s the deal? Was it a Moon Landing or a Safari?
dbr:Tranquility Base geo:lat 0.713889; geo:long 23.7078 .
W3C Basic Geo spec declares WGS84 as the coordinate reference system - but this is
not enforced through axiomatization, so there is no consideration for preventing
geo:lat and geo:long fromm being used to represent locations on any celestial body,
not just Earth. The Moon, Mars, Tatooine, etc.
The oversimplification of vocabularies or schemas (for making publication easier)
can lead to the incorrect usage of an ontology.
Domain Error
Let’s perform a simple, typical, spatial query using Linked Data:
How many people live around the Gulf of Guinea?
Population = 7.6 billion
According to our query results, the Gulf of Guinea has the highest population density
in the world... How can this be? Well, because we didn’t expect planet Earth to be
located in it’s own reference system! Earth has a population value, so it gets counted
in our results as if it were just another populated place.
Data Quality via Ontology Tradeoff
Lessons Learned:
It is critical for data publishers to fully understand an ontology’s intended uses
when selecting one to construct their Linked Data.
Lifting data is not trivial; it needs to involve both domain experts and
experienced Linked Data developers.
All spatial data should have a CRS, but this imposes another hurdle-to-entry
for data publishers. Too little restriction threatens data quality; too much
deters data publishers.
Discrepancies among data sources and a lack of provenance information is toxic
to researchers who cannot ascertain its reliability.
Modeling Errors
Modeling Errors
DBpedia shows 1.8k 0-degree persons, 371k 1-degree persons, and 31k
2-degree persons. Higher-degree persons may be from lack of information
about their birth / death place, or may be a fictitious character identified as
type Person. 0-degree persons indicate modeling [email protected]
Terry Fox
“Terry Fox” is one of those 0-degree persons, his resource includes spatial
coordinates. But it looks like the person Terry Fox was accidentally matched to
the statue of Terry Fox.
Terry Fox
Plotting the coordinates on a map reveals a place called “Mt. Terry Fox
Provincial Park”. This clearly demonstrates the consequences of a modeling
error.
Data Accuracy and an Uncertainty
Framework
Accuracy
There are 136,964 combinations of geometries3 among places with cardinal
direction relations on DBpedia. According to our analysis, by using 8 equal
divisions of the compass rose, nearly 13
of these relations are inaccurate.
Using 8 equal divisions (π4) of the compass Nearly 1
3of all relations are innaccurate
3Formatted in Well-Known Text: Geographic coordinates [email protected]
Accuracy
Part of the blame for innacurate cardinal direction relations can be placed on
using point geometries for regions, making the relation true in only a portion of
the cases.
Uncertainty
Decimal and coordinate values can be misleading; their precision implies
accuracy to the degree of the least significant digit; e.g., the centroid of Santa
Barbara is accurate to 1.1 microns:
POINT(-119.71416473389 34.425834655762)
Also, it has an area of 108.69662101458125 km2, which is accurate to a few
hundred femtometers (10e−13).
Clearly, there is a need for an uncertainty framework when it comes to providing
measurement data.
Conclusion
Conclusions:
Geographic Information plays a key role in interlinking structured data on the
Web. Improving geo-data quality is pivotal to improving the functionality and
reliability of Linked Data for science, research, applications, etc.
We identified systematic errors in geographic Linked Data, discussed their
causes, and suggested ways to improve its quality and reliability.
Striking the balance between (a) keeping models simple and easy to use so that
they enable streamlined data publishing processes and (b) hazardous
oversimplifications, remains a major challenge to be addressed in future works.