Upload
ansgar-scherp
View
1.107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides of my inauguration talk at the University of Mannheim in Germany in October 2012. Download this slide set to enjoy all animations.
Citation preview
Slide 1Ansgar Scherp – [email protected]
How to Juggle with more
than a Billion Triples?
Ansgar ScherpResearch Group on Data and Web Science
Universität MannheimOctober 2012
Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/
Slide 2Ansgar Scherp – [email protected]
My thanks go to …
• Marianna• Simon Schenk• Carsten Saathoff• Thomas Franz• Thomas Gottron• Steffen Staab • Arne Peters• Bastian Krayer
• Daniel Eißing• Mathias Konrath• Daniel Schmeiß• Anton Baumesberger • Frederik Jochum • Alexander Kleinen
And many more …
Slide 3Ansgar Scherp – [email protected]
Scenario
• Tim plans to travel– from London – to a customer in Cologne
Slide 4Ansgar Scherp – [email protected]
Website of the German Railway
It works, why bother…?
Eurostar
DB
KVB
Slide 5Ansgar Scherp – [email protected]
Let‘s Try Different Queries
Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map? …
All these queries cannot be answered, because the data …
Slide 6Ansgar Scherp – [email protected]
… locked in Silos!
– High Integration Effort– Lack in Reuse of Data
B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
Slide 7Ansgar Scherp – [email protected]
Linked Data
• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web
World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)
Example: http://www.uni-mannheim.de/
Slide 8Ansgar Scherp – [email protected]
Relevance of Linked Data?
Slide 9Ansgar Scherp – [email protected]
Linked Data: May ‘07
< 31 Billion Triples
Media
Geographic
Publications
Web 2.0
eGovernment
Cross-Domain
LifeSciences
Sept. ‘11
Source: http://lod-cloud.net
Slide 10Ansgar Scherp – [email protected]
Linked Data Principles
1. Identification2. Interlinkage3. Dereferencing4. Description
Slide 11Ansgar Scherp – [email protected]
Example: Big Lynx
Big LynxCompany
Matt Briggs
Scott Miller
Source: http://lod-cloud.net< 31 Milliarde Triple
?
Slide 12Ansgar Scherp – [email protected]
1. Use URIs for Identification
B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY
http://biglynx.co.uk/people/matt-briggs
http://biglynx.co.uk/people/scott-miller
Matt Briggs
Scott Miller
Slide 13Ansgar Scherp – [email protected]
Example: Big Lynx
Big LynxCompany
Matt Briggs
Scott Miller
How to model relationships like knows?
Slide 14Ansgar Scherp – [email protected]
• Description of Ressources with RDF triple
Matt Briggs is a Person
Resource Description Framework (RDF)
Subject Predicate Object
@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .
Slide 15Ansgar Scherp – [email protected]
1. Use URIs also for Relations
B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY
http://biglynx.co.uk/people/matt-briggs
http://biglynx.co.uk/people/scott-miller
foaf:knows
foaf:knows
Slide 16Ansgar Scherp – [email protected]
Example: Big Lynx
Big Lynx Company
DBpedia
Matts privateWebseite
„sameperson“
Matt Briggs
„lives here“
Dave SmithLondon
…
Matt Briggs
Scott Miller
Slide 17Ansgar Scherp – [email protected]
2. Establishing Interlinkage
• Relation links between ressources
<http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .
Identity links between ressources
<http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .
Slide 18Ansgar Scherp – [email protected]
Example: Big Lynx
DBpedia
Big Lynx Company
Matts privateWebseite
foaf:based_near
owl:sameAs
Matt Briggs
Dave SmithLondon
Matt Briggs
„same Person“
„lives here“
Slide 19Ansgar Scherp – [email protected]
• Looking up of web documents
• How can we “look up” things of the real world?
3. Dereferencing of URIs
http://biglynx.co.uk/people/matt-briggs
Slide 20Ansgar Scherp – [email protected]
Two Approaches
1. Hash URIs– URI contains a part separated by #, e.g.,
http://biglynx.co.uk/vocab/sme#Team
2. Negotiation via „303 See Other“ request
http://biglynx.co.uk/people/matt-briggs
Response: „Look here:“http://biglynx.co.uk/people/matt-briggs.rdf
Slide 21Ansgar Scherp – [email protected]
Example: Big Lynx
DBpedia
Big Lynx Company
Matts privateWebseite
foaf:based_near
owl:sameAs
Matt Briggs
Dave SmithLondon
Matt BriggsDescription of Matt?
Slide 22Ansgar Scherp – [email protected]
4. Description of URIs
biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
dp:Birmingham
rdf:type
foaf:knows
foaf:based_near
_:point
wgs84:lat
wgs84:long
dp:London
foaf:based_near
……
…
…
ex:loc
…
“-0.118”
“51.509”
Slide 23Ansgar Scherp – [email protected]
Formalization of Description
),,( EPVG Given a RDF graph with
VPBRELBRV )( and
SimpleCBD(n) = I with
∩j = 0
I 0 = { (s, p, o) | (s, p, o) E s = n }
I = { (o, p‘, o‘) E | (s, p, o) I : o B
j
jj+1
∩
k = 0k
j
(o, p‘, o‘) I }
∞
Slide 24Ansgar Scherp – [email protected]
W3C RDF / RDF Schema Vocabulary
• rdf:type • rdf:Property • rdf:XMLLiteral • rdf:List • rdf:first • rdf:rest • rdf:Seq • rdf:Bag • rdf:Alt • ... • rdf:value
• rdfs:domain • rdfs:range • rdfs:Resource • rdfs:Literal • rdfs:Datatype • rdfs:Class • rdfs:subClassOf • rdfs:subPropertyOf • rdfs:comment • …• rdfs:label
• Set of URIs defined in rdf:/rdfs: namespace
Slide 25Ansgar Scherp – [email protected]
Semantic Web Layer Cake (Simplified)
Slide 26Ansgar Scherp – [email protected]
Exploration of Linked Data
Source: http://lod-cloud.net< 31 Billion Triples
WordNet
Swoogle
GeoNames
Slide 27Ansgar Scherp – [email protected]
Naive Approach
• Download all data• Store in really big
database• Programming of
queries• Design of
user interface
Swoogle
WordNet
GeoNames
Inflexible Monolithic
Notscaleable
RDFS
Rules Geo
Slide 28Ansgar Scherp – [email protected]
SemaPlorer Approach
GeoQueries
GeoNames
Flexible
Scaleable
Extensible
RDFS Rules
WordNet Swoogle
Fulltext
12 Month in 2005/06 700 Mio. Triple
+ + +
> 1 Billion Triples
placeOfBirthbirthplace
birthplace
+
Slide 29Ansgar Scherp – [email protected]
SemaPlorer – Semantic Social Media
Watch video online: http://vimeo.com/2057249
Slide 31Ansgar Scherp – [email protected]
Searching for Linked Data Sources
Quelle: http://lod-cloud.net< 31 Milliarde Triples
Persons that are - Politicians and - Actors ? ?
Slide 32Ansgar Scherp – [email protected]
Idea: Index of Data Sources
“Politician and Actor”
?Query
SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .}
Index
Slide 33Ansgar Scherp – [email protected]
The Naive Approach
1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup
- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?
Can we do smarter?
Slide 34Ansgar Scherp – [email protected]
Idea Schema-level index
Define families of graph patternsAssign instances to graph patternsMap graph patterns to context (source URI)
ConstructionStream-based for scalabilityLittle loss of accuracy
Note Index defined over instancesBut stores the context
Slide 35Ansgar Scherp – [email protected]
Input Data n-Quads
<subject> <predicate> <object> <context>
Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>
w3p:#me
foaf:Person
http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf
Slide 36Ansgar Scherp – [email protected]
SchemEX Approach• Stream-based schema extraction• While crawling the data
Nquad-Stream
Schema-LevelIndex
Schema-Extractor
Parser
Instance-Cache
LOD-CrawlerRDF-DumpTriple Store
NxParser
RDFRDBMS
FIFO
Slide 37Ansgar Scherp – [email protected]
Building the Index from a Stream Stream of n-quads (coming from a LD crawler)
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
4
3
2
1
1
6
23
4
5
C3
C2
C2
C1
• Linear runtime complexity wrt # of input triples
Slide 38Ansgar Scherp – [email protected]
DS1 DS2 DS3 DS4 DS5 DSxData
sources
consistsOf
hasDataSource
Building the Schema and Index
…
EQC1 EQC2 EQCn Equivalenceclasses
hasEQClass p1 p2
…
TC1 TC2 TCm
Type clusters…
C2C1 C3 CkRDF
classes…
Slide 39Ansgar Scherp – [email protected]
Layer 1: RDF Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
http://dig.csail.mit.edu/2008/...
C1
DS 3DS 2DS 1
SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}
All instances of a particular type
Slide 40Ansgar Scherp – [email protected]
Layer 2: Type Clusters
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
DS 3DS 2DS 1SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}
C1
C2
pim:Male
tc4711
pim:Male
All instances belonging to exactly the same set of types
TC1
Slide 41Ansgar Scherp – [email protected]
Two instances are equivalent iff:They are in the same TCThey have the same
propertiesThe property targets are
in the same TC
Similar to 1-Bisimulation
Layer 3: Equivalence Classes
EQC1
DS 3DS 2DS 1
C1
C2
TC1
C3
TC2
p
Slide 42Ansgar Scherp – [email protected]
Layer 3: Equivalence Classes
timbl:card#i
foaf:Person
foaf:Person
http://www.w3.org/People/Berners-Lee/card
pim:Male
tc4711
pim:Male
foaf:PPD
timbl:card
eqc0815
foaf:PPD
tc1234
eqc0815-maker-tc1234
foaf:maker
SELECT ?x WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}
Slide 43Ansgar Scherp – [email protected]
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec
• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+
• Commodity hardware (4GB RAM, single CPU)
Slide 44Ansgar Scherp – [email protected]
Quality of Stream-based Index Construction
+ Runtime increases hardly with window size+ Memory consumption scales with window size
Slide 47Ansgar Scherp – [email protected]
And 2012? Get the Google Feeling!
Slide 48Ansgar Scherp – [email protected]
Semantic Data Management Chain• Research topics in a greater context
UseAggregatePublish Collect
OntoMDE
Core OntologiesKreuzverweis.com
SchemEX*
Mobile Facets
SemaPlorer*
* Winner of Billion Triple Challenge 2011/2008
See at: dws.informatik.uni-mannheim.de
Slide 49Ansgar Scherp – [email protected]
Slide 50Ansgar Scherp – [email protected]
Recommended Readings• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x
• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)URL: http://dx.doi.org/10.1016/j.websem.2009.09.006
• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001