Upload
robert-sanderson
View
1.203
Download
2
Embed Size (px)
Citation preview
STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or Why We Need Reconciliation
April 4th, 2016
T H E A A C / G E T T Y W O R K S H O P O N R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
Rob Sanderson / [email protected] / @azaroth42
STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or Why We Need Reconciliation
April 4th, 2016
T H E A A C / G E T T Y W O R K S H O P O N R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
Rob Sanderson / [email protected] / @azaroth42 web.stanford.edu/~azaroth/#me
[email protected] / +azaroth42 orcid: 0000-0003-4441-6852
STANFORD UNIVERSITY LIBRARIES
The Linked Data Snowball or Why We Need Reconciliation
April 4th, 2016
T H E A A C / G E T T Y W O R K S H O P O N
Rob Sanderson / [email protected] / @azaroth42 web.stanford.edu/~azaroth/#me
[email protected] / +azaroth42 orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ [email protected] / [email protected]
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth [email protected] / [email protected]
R E C O N C I L I AT I O N O F L I N K E D O P E N D ATA
Linked Data?
1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful
information, using the standards 4. Include links to other URIs, so they can discover
more things
Linked Data?
1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful
information, using the standards 4. Include links to other URIs, so they can discover more
things 5. Link your data to other people's data to provide
context
Why So Many? Do I know the URI, or can I find it?
URI
No
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
URI
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
Understand and agree with the description? No
URI
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
Understand and agree with the description? No
Agree the URI identifies the same entity? No
URI
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
Understand and agree with the description? No
Agree the URI identifies the same entity? No
Agree description is complete? No
URI
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
Understand and agree with the description? No
Agree the URI identifies the same entity? No
Agree description is complete? No
Hooray, you reused a URI! URI Yes
Why So Many? Do I know the URI, or can I find it?
No
Understand and agree with the model used? No
Understand and agree with the description? No
Agree the URI identifies the same entity? No
Agree description is complete? No
Hooray, you reused a URI! Now start again with the next entity :(
URI Yes
Many Special and Unique Snowflakes
Become a Huge Snowball of Technical Debt
Option 1: Balance the Equation
Cost(Create URI)!+!
Cost(Maintain URI) ! !
Cost(Find Good URI)+ Cost(Understand Model)+ Cost(Understand Content)
+!min( Risk(Reliability)+!
Cost(Network Latency),!Risk(Out of Date)+!
Cost(Cache Content)) -!
Value(Connected Graph)!
<=
Option 1 Likelihood
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254!
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254 :)!
Option 1 Likelihood
Botticelli: http://vocab.getty.edu/ulan/500015254!:(
Option 2: Reconciliation
YCBA's URIs Princeton's URIs
Option 2: Reconciliation
YCBA's Entities
Princeton's Entities
Shared Entities but not Shared URIs
Option 2: Reconciliation
1. Algorithmically discover this intersection given the descriptions of the entities
Option 2: Reconciliation
2. Assert that the entity which two URIs identify is actually the same entity
=
Option 2: Reconciliation
Option 2a: Reconciliation (distributed authority)
Option 2b: Reconciliation (centralized authority)
Benefits of Reconciliation End User:
• Has access to more information, more easily, improving research, discovery and navigation
• Potential for new UIs, new research questions, reasoning
Institution: • Efficiency (= reduced cost) and improved quality of description • Increased prestige when descriptions are reused • Usage across the network is valuable business intelligence
Community: • Network effects spread faster and further, increasing awareness of
cultural heritage • Gives easier access to other communities' data
Real Benefit of Reconciliation
Reconciliation is a network damage limiting step towards balancing Equation 1
By linking entity descriptions together:
• the cost of discovery and understanding is reduced
• the costs of creating and maintaining the resources are shared across the community, not duplicated
• the value of the connected graph is increased
• the likelihood of new entities (requiring reconciliation) is reduced
But How Can A Machine Know??
Algorithms won't be perfect, but can be good enough.
• What use cases will the reconciled data be used to fulfill?
• What is the cost of a false positive for those use cases?
Precision: What % of matches are correct?
Recall: What % of the possible matches were found?
Can make trade-offs of precision vs recall for different use cases.
Machine can record its certainty, and policy can provide a threshold.
How Can We Improve It? Several different relationships to express similarity:
• owl:sameAs – always exactly the same (transitive)
• skos:exactMatch – the same for most purposes (transitive)
• skos:closeMatch – the same for some purposes (intransitive)
The context of resource in the network is important
• Starting simple with high precision gives a better context to use the results to iteratively and incrementally bootstrap
Trust and Community "Efficiency (= reduced cost) and improved quality of description" • Efficiency comes from not duplicating descriptive effort... • Which requires trusting other institutions in the community • We need to work together, not...
Trust and Community "Efficiency (= reduced cost) and improved quality of description" • Efficiency comes from not duplicating descriptive effort... • Which requires trusting other institutions in the community • We need to work together, not...
Entities to Reconcile As a community, we need to pick where to start. Suggest starting with least controversial / most unique:
• Physical objects • People • Places • Events (specific, like Exhibitions)
A small sub-domain (by time?) to make overlap more likely
Q. Can I Reconcile a String?
Named Entity Recognition
"snowflake" = .
strings to things
Reconciliation
. = .
things to things
The Hard Question
How can we be more useful than DBPedia for our own entities?
The Hard Question
How can we be more useful than DBPedia for our own entities?
• Focus on unique selling points • Demonstrate value early,
both internally and to the broader community • By working together to increase the value of the network
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
Rob Sanderson / [email protected] / @azaroth42 web.stanford.edu/~azaroth/#me
[email protected] / +azaroth42 orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ [email protected] / [email protected]
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth [email protected] / [email protected]
STANFORD UNIVERSITY LIBRARIES
Thank You!
April 4th, 2016
Rob Sanderson / [email protected] / @azaroth42 web.stanford.edu/~azaroth/#me
[email protected] / +azaroth42 orcid: 0000-0003-4441-6852
http://www.informatik.uni-trier.de/~ley/pers/hd/s/Sanderson:Robert http://academic.research.microsoft.com/Author/2765999
http://www.scopus.com/authid/detail.url?authorId=8988953600 www.researchgate.net/profile/Rob_Sanderson
facebook.com/rob.sanderson / linkedin.com/pub/robert-sanderson/1/172/5a6/ [email protected] / [email protected]
public.lanl.gov/rsanderson / gondolin.hist.liv.ac.uk/~azaroth [email protected] / [email protected]
ThankYou!
rsanderson@ge*y.edu
April 25th, 2016
ThankYou!
rsanderson@ge*y.edu
Based on my slides from Andrew W. Mellon Foundation Reconciliation Workshop With recognition and thanks to all of the participants