Upload
christophe-gueret
View
2.585
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presentation given at a seminar in Yahoo.
Citation preview
Exploring Linked Data contentthrough network analysis
Christophe Guéret (@cgueret)Free University Amsterdam
Co-explorers: Stefan Schlobach, Shenghui Wang, Paul Groth, Frank van Harmelen
http://latc-project.eu http://www.vu.nl
November 23, 2011 Analysis of Linked Data 2/35
Outline of the talk
What is Linked Data?
What is there is to be analysed?
Do we miss something?
New research directions and first results
November 23, 2011 Analysis of Linked Data 3/35
Linked Data
http://www.flickr.com/photos/erikcharlton/3337465138
Linked Data (aka Semantic Web)
November 23, 2011 Analysis of Linked Data 4/35
What is the problem?Frank and Christophe publish some open data
Roi wants to combine and enrich it
Marvel icons: mermer, DeviantArt
Kennissen Staad
Christophe Amsterdam
Peter Barcelona
David ParijsFrank
Ville Pays
Barcelone Espagne
Paris France
Amsterdam Pays-BasChristophe
Roi
WWW
WWW
November 23, 2011 Analysis of Linked Data 5/35
What is the problem?
Data integration issue
“Kennissen”, “Staad”, “Ville”, “Pays” ?
“Paris” = “Parijs” ?
“Amsterdam” = “Amsterdam” ?
Lot of work, must be done again on updates
Kennissen Staad
Christophe Amsterdam
Peter Barcelona
David Parijs
Ville Pays
Barcelone Espagne
Paris France
Amsterdam Pays-Bas
+ = ?
November 23, 2011 Analysis of Linked Data 6/35
A solution
Do data integration at the data level
Use, and re-use, unambiguous identifiers
Use meta-level descriptions of the identifiers
Proposal: use the Web as a platform
Identifiers = URIs
Descriptions = de-referenced documents
November 23, 2011 Analysis of Linked Data 7/35
ex:Acquaintance
ex:Christophe ex:Peter ex:David
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
ex:worksIn ex:worksIn ex:worksIn
rdf:type rdf:type rdf:type
Frank publishes his data Kennissen Staad
Christophe Amsterdam
Peter Barcelona
David Parijs
Use of compact URIsdbpedia = http://dbpedia.org/resource/ex = http://example.org/rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns#
This is a “triple”
November 23, 2011 Analysis of Linked Data 8/35
Christophe re-use part of Frank's data to publish his data
ex:Acquaintance
ex:Christophe ex:Peter ex:David
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
dbpedia:Netherlands dbpedia:Spain dbpedia:France
ex:worksIn ex:worksIn
ex:isIn ex:isIn
ex:worksIn
ex:isIn
rdf:type rdf:type rdf:type
Ville Pays
Barcelone Espagne
Paris France
Amsterdam Pays-Bas
November 23, 2011 Analysis of Linked Data 9/35
Roi add some more information
ex:Acquaintance
ex:Christophe ex:Peter ex:David
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
dbpedia:Netherlands dbpedia:Spain dbpedia:France
dbpedia:Europe
ex:worksIn ex:worksIn
ex:isIn ex:isIn
ex:worksIn
ex:isIn
ex:isInex:isInex:isIn
rdf:type rdf:type rdf:type
“Conocido”@es
rdf:label
November 23, 2011 Analysis of Linked Data 10/35
dbpedia:Amsterdam
November 23, 2011 Analysis of Linked Data 11/35
Reasoning with Semantics Bonus!
dbpedia:Netherlands
dbpedia:Europe
ex:isIn
dbpedia:Amsterdam
ex:isIn
ex:isIn
owl:TransitiveProperty
rdf:type
+ =
dbpedia:Europe
ex:isIn
dbpedia:Amsterdam
Example usage
Materialize implicit information
Check for consistency
November 23, 2011 Analysis of Linked Data 12/35
Rough estimate of size
295 data sets, 31B facts in LOD Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
November 23, 2011 Analysis of Linked Data 13/35
Lots of Data to analyze! :-)
http://www.flickr.com/photos/argonne/3323018571
November 23, 2011 Analysis of Linked Data 14/35
But analyzing what exactly?
Table of facts published at different locations
A distributed Knowledge BaseSubject Predicate Object
ex:Christophe rdf:type ex:Acquaintance
ex:Christophe ex:worksIn dbpedia:Amsterdam
ex:Peter rdf:type ex:Acquaintance
... ... ...
Subject Predicate Object
dbpedia:Amsterdam ex:isIn dbpedia:Netherlands
dbpedia:Netherlands ex:isIn dbpedia:Europe
... ... ...
Subject Predicate Object
ex:Acquaintance rdf:label “Conocido”@es
... ... ...
November 23, 2011 Analysis of Linked Data 15/35
Analysis workflow
1.Gather a snapshot of triples
2.Compute descriptive statistics
Top resources (subject, predicate, object)
Frequency cross-links types (SP,SO,PO,...)
Connected components
Paths frequency
…
=> Tricky enough, the data is really big!
=> We should be able to get more out of the data
November 23, 2011 Analysis of Linked Data 16/35
Can we explain that?
Suggestions
Started the graph
General knowledge
Very well known
November 23, 2011 Analysis of Linked Data 17/35
or that?
Suggestions
All published by Bio2RDF
Well aware of each other
Overlapping domain
November 23, 2011 Analysis of Linked Data 18/35
Could we predict the impact of ...
Dbpedia being down for a while ?
SIOC renaming “User” into “UserAccount” ?
creating a dataset that turns out to be popular ?
Analysing a set of triples is not enough
November 23, 2011 Analysis of Linked Data 19/35
Are we overlooking something?
November 23, 2011 Analysis of Linked Data 20/35
It's not only about the resources
Several entities related to the data
Data publishers/consumers Resources Web servers
ex:something
Interactions between all of them
WWW
WWW
November 23, 2011 Analysis of Linked Data 21/35
There are different scales
Triples level versus Resource groups level
Different data complexity at each scale
ex:Acquaintance
ex:Christophe ex:Peter ex:David
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
dbpedia:Netherlands dbpedia:Spain dbpedia:France
dbpedia:Europe
ex:worksIn ex:worksIn
ex:isIn ex:isIn
ex:worksIn
ex:isIn
ex:isInex:isInex:isIn
rdf:type rdf:type rdf:type
“Conocido”@es
rdf:label
November 23, 2011 Analysis of Linked Data 22/35
It is not a static network
Size and topology evolve over time
2007 2008 2010
November 23, 2011 Analysis of Linked Data 23/35
Linked Data is a Complex System
Multiple scale of observation
Emergence of properties
The whole is more than the sum of the parts
=> Interactions/relations are important to understand the system behavior
=> We can benefit from a large body of research results in Complex Systems study
November 23, 2011 Analysis of Linked Data 24/35
Initial findings and future work
Ya3hs3/2531493704 on Flickr
November 23, 2011 Analysis of Linked Data 25/35
New analysis workflow
1.Gather a snapshot of triples
2.Gather information about other type of interactions
3.Create specific networks related to the research questions at hand
4.Run metrics, interpret results
November 23, 2011 Analysis of Linked Data 26/35
The LOD is not what we think it is
LOD Cloud 2009/2010 vs BTC 2009 crawl
Crawled sample differs from the community based view
LOD Cloud has lumpy structure
Evolution of LOD Cloud
centrality changes
Increased density and connectivityChristophe Guéret, Shenghui Wang, Paul Groth et al. (2011)
Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's CommunityAdvances in Complex Systems 14 (04)
November 23, 2011 Analysis of Linked Data 27/35
November 23, 2011 Analysis of Linked Data 28/35
The tools we need don't exist
We need to flatten the networks to study them
Some specific aspects of the system
Existence of implicit links
Multi-relational and dynamic
Distributed
Hypergraph of relations
Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011)Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's Community
Advances in Complex Systems 14 (04)
November 23, 2011 Analysis of Linked Data 29/35
Influence content<->social networks
Generate and bind two networks
Measure evolution of degree, betweenness, clustering over time
Predict evolutionShenghui Wang, Paul Groth (2010)
Measuring the dynamic bi-directional influence between content and social networksProceedings of the 9th International Semantic Web Conference (ISWC2010)
ex:a
ex:c
ex:b
November 23, 2011 Analysis of Linked Data 30/35
Result for conferences
Shenghui Wang, Paul Groth (2010)Measuring the dynamic bi-directional influence between content and social networks
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 31/35
Centrality to measure robustness
Map the BTC2010 to two networks
Semantic network based on namespaces
Host networks based on hostnames
Measure robustness as the variance in betweenness centrality
Find weak spots
Optimize networks to increase robustnessChristophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendationProceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 32/35
Results on hostnames
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 33/35
Results on namespaces
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 34/35
Improving the network
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 35/35
Conclusion
Take home message
Linked Data is not a simple knowledge base
Network analysis tools give new insights on the data
Results can be used to improve the network
Future work
Make resource-centric analysis rather than graph-centric analysis (big bottleneck now)
Tackle the time aspect of the data
Find more analysis to perform and what they tell us