71
Harald Sack Internet Technologies and Systems (ITS) Future Internet Technologies Hasso-Plattner-Institute for IT Systems Engineering 5th Annual Symposium on Future Trends in Service-Oriented Computing June 16th, 2010 Hasso-Plattner-Institute for IT Systems Engineering Potsdam Linked Open Data Universe

Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

Embed Size (px)

Citation preview

Page 1: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

Harald SackInternet Technologies and Systems (ITS) Future Internet TechnologiesHasso-Plattner-Institute for IT Systems Engineering

5th Annual Symposium on Future Trends in Service-Oriented ComputingJune 16th, 2010Hasso-Plattner-Institute for IT Systems EngineeringPotsdam

Linked Open Data Universe

Page 2: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

2

The Web is huge....

To be more precise, the WWW is rather huge...•more than 25 x 109 documents in

Search engine indexes (TNL Blog: Google has 24 billion items index, considers MSN search nearest competitor, September 2005)

•Google Web Crawler found more than 1012 documents(The Official Google Blog: We knew the Web was Big....., Juli 25, 2008)

•New Google Search Index Caffeine comprises 100 Million Gigabytes of datai.e. 1017 Byte (SMX Video: Google’s Matt Cutts On Caffeine Launch, June 9, 2010,http://searchengineland.com/smx-video-googles-matt-cutts-on-caffeine-launch-43933)

•And then, there is also the DeepWeb (Darkweb) ...and it is supposed to be up to 500 time larger than the Surface Web(Bergman, 2001)

Page 3: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

3

The Web is growing...

Multimedia, Real-Time Data, Sensor Data, ....

in 06/2010: 7 TB/day

in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day

in 06/2010: 7 TB/day

Page 4: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

3

The Web is growing...

Multimedia, Real-Time Data, Sensor Data, ....

in 06/2010: 7 TB/day

in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day

in 06/2010: 7 TB/day

Page 5: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

4

How to find something on the Web?

in 06/2010: 7 TB/day

Page 6: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

5

The ‘Web of Data‘

Semantic Web Technologies

• Interoperable and machine understandabledata semantics

• Based on formal knowledge representations

• Creating a ‘Web of Data‘

Page 7: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

6

• Topic: Semantic Web and Linked Data

•Problems and Experiments

•Application: Exploratory Multimedia Search

Linked Open Data Universe

Page 8: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

7

Semantic Web and Linked Data

From World Wide Web to Web of Data„The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help… “

Prerequisites:

• Content can be read and interpreted correctly (=understood) by machines

Tim Berners-Lee, Semantic Web Roadmap, Sept 1998

Semantic Web• (natural language) web content is

explicitely annotated with semantic metadata

• semantic metadata encode the meaning (semantics) of web content and can be read andinterpreted correctly my machine

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Page 9: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

8

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Page 10: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

8

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Page 11: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

8

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

Page 12: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

8

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Fabio CapelloManager ofUK National

Football Team

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

Page 13: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

8

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Fabio CapelloManager ofUK National

Football Team

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

David JamesGoal Keeper of

UK NationalFootball Team

Page 14: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

9

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Page 15: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

9

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Soccer Manager

is a

Page 16: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

9

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Soccer Manager

is a

Person

is a

Page 17: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

10

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Page 18: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

10

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Soccer Manager

is a

(class)

Class-membership has type

Page 19: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

10

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Soccer Manager

is a

(class)

Class-membership has type

Person

is a

(class)

superclass

subclass

is subclass of

Page 20: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

11

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

is aEntities

Classes

Page 21: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

11

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlace

is aEntities

Classes

Page 22: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

11

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is aEntities

Classes

Page 23: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

11

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is a

hasBirthDate1946-06-18

is a

Entities

Classes

Page 24: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

11

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is a

hasBirthDate1946-06-18

is a

San Canzian d‘IsonzohasBirthPlace

is a

Entities

Classes

Page 25: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

12

Semantic Web and Linked Data

Page 26: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

13

Semantic Web and Linked Data

Fabio Capello http://dbpedia.org/resource/Fabio_Capello

URI - Uniform Resource Identifier

Page 27: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

14

Semantic Web and Linked Data

http://dbpedia.org/resource/Fabio_Capello

http://en.wikipediapedia.org/resource/Fabio_Capello

Page 28: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

15

Semantic Web and Linked Datahttp://dbpedia.org/resource/Fabio_Capello

Page 29: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

16

Semantic Web and Linked Data

http://dbpedia.org/resource/Fabio_Capello

RDF Resource Description Framework

:Fabio_Capello dbpp:birthPlace :San_Canzian_d%27Isonzo .:Fabio_Capello dbpp:birthDate “1946-06-18“ .:Fabio_Capello rdfs:type dbpo:SoccerManager .:Fabio_Capello rdfs:type dbpo:Person ....

:Fabio_Capello rdf:type dbpo:SoccerManager .

RDF Tripel RDF Subject RDF Property RDF Object

Page 30: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

17

Semantic Web and Linked Data

http://dbpedia.org/ontology/soccer_manager

dbpo:SoccerManager rdf:type owl:class .dbpo:SoccerManager rdfs:subClassOf dbpo:Person .dbpo:SoccerManager rdfs:label “Soccer Manager“ .dbpp:birthPlace rdf:type rdf:Property .dbpp:birthPlace rdfs:domain dbpo:Person .dbpp:birthPlace rdfs:range dbpo:Place .dbpp:birthDate rdf:type rdf:Property .dbpp:birthDate rdfs:domain :Person .dbpp:birthDate rdfs:range xsd:date ....

RDF Schema

Person PlacehasBirthPlaceDate hasBirthDate

Soccer Manager

is a

Page 31: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

18

Semantic Web and Linked DataUnderstanding Web Content - V

Fabio Capello

LivingPeople

PersonDate

hasBirthDate1946-06-18

hasBirthDate

is a

is a

is a

DeadPeople∩ =∅

logical constraint

is a

+ Rules (Description Logics)

∀x.∃y.hasDeathDate(x,y) ∧ Person(x) ∧ Date(y) → DeadPeople(x)

Page 32: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

19

Semantic Web and Linked Data

SELECT DISTINCT ?l ?l2 ?g FROM <http://dbpedia.org> WHERE { ?s dbpp:nationalteam ?o . ?s rdfs:label?l FILTER langMatches( lang(?l), "EN" ) . ?s dbpp:nationalgoals ?g FILTER(?g>10). ?s dbprop:nationalteam ?nat . ?nat rdfs:label ?l2 FILTER langMatches( lang(?l2), "EN" ).} ORDER BY DESC(?g)

Select all players of a soccer nationalteam that have scored more than 10 goals while inthe team

Page 33: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

20

Semantic Web and Linked Data

Select all players of a soccer nationalteam that have scored more than 10 goals while in the team

Page 34: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

21

Semantic Web and Linked Data

(RDF)

(URI)

M.Hausenblas, Quick Linked Data Introduction, http://www.slideshare.net/mediasemanticweb/quick-linked-data-introduction

Linked Data■ Term was originally coined by Tim Berners-Lee

(Tim Berners-Lee, Linked Data, 2006, http://www.w3.org/DesignIssues/LinkedData.html)

The Web of data is abouta dataand namingmodel on the Web

Page 35: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

22

Semantic Web and Linked Data

Linked Data

■ Technical Principles

□ use URIs to identify things uniquely (not only documents...)

□ use HTTP URIs (URLs) so that these things can be referred to and looked up ("dereferenced") by people and user agents

□ use RDF as an universal data model to provide useful information about these things

□ include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

Page 36: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

23

Semantic Web and Linked Data

Linked Data□ The application lf the Linked Data principles leads to the creation of a

,Web of Data‘

Page 37: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

24

Semantic Web and Linked Data

Linking Open Data■ Public available structured data should be published as Linked Data

■ Various data sources should be interlinked

LOD-WikiPage: http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/

Page 38: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

25

Semantic Web and Linked Data

Page 39: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

Linked Data Achievments■ Extension of the Web with a

data commons (14b RDF triples = facts)

■ Vibrant global RTD community

■ Industrial uptake starting(BBC, Thomson, Reuters, etc.)

■ Emerging governmental adoption in sight

■ Establishing Linked Data as a deployment path for the Semantic Web

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

26

Semantic Web and Linked Data

Page 40: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

Linked Data Challenges■ Coherence

relatively few, expensively maintained links

■ Qualitypartly low quality data and inconsistencies

■ Performancestill substantial penalties compared torelational database technologies

■ Data consumptionlarge scale processing, schema mapping anddata fusion still in its infancy

■ UsabilityMissing direct end user tools and network effect

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

27

Semantic Web and Linked Data

Sören Auer:"Linked Data: Now what?"ESWC2010 Panel Discussion

Page 41: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

28

• Topic: Semantic Web and Linked Data

•Problems and Experiments

•Application: Exploratory Multimedia Search

Linked Open Data Universe

Page 42: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

29

Problems and Experiments

Page 43: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

30

Problems and Experiments

Page 44: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

31

Problems and Experiments

A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010

Page 45: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

32

Problems and Experiments

Experiment Summary (1) Crawling the Semantic Web

(2) Structural Analysis

(3) Content-based Analysis

(4) Data Cleansing

(5) Heuristics for Ranking Semantic Web Data

(6) Augmenting Semantic Web Infrastructure

Page 46: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

33

Problems and Experiments

So what? ■ Interesting Facts to find out about

Semantic Web & Linked Data

■How big is the Semantic Universe?

■ # tripel

■ # documents

■ # interlinking

■ Linking Open Data is only registered vocabulary/data in the LOD-Wiki→ 14b RDF triples

■What else is out there ... and how much of it?

■ ...and how do we get it?

Page 47: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

34

Problems and Experiments

(1) Crawling the Semantic Web■Of course we are not the first to be out there...

■ SwoogleLi Ding et al: Finding and Ranking Knowledge on the Semantic Web, ISWC 2005.

■ Scutter/Slug Leigh Dodds: Slug: A Semantic Web Crawler, 2006

■ Sindice Giovanni Tumarello et al: Sindice.com - weaving the open linked data, ISWC 2007

→ 2.1b RDF triples

■ SWSE Andreas Harth et al: SWSE: Objects before Documents,

Semantic Web Challenge 2008, ISWC 2008

→ 1.1b RDF triples

■ FalconsG.Cheng et al.:Falcons: Searching and Browsing Entities on the Semantic Web, WWW17 2008.

→ 2.9b RDF triples

Page 48: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

35

Problems and Experiments

(1) Crawling the Semantic Web■ First experiments:

■ Adapting & Improving Slug Crawler

■ for parallelization (48 Cores) and

■ lots of RAM (256GB - 2TB)

■ first test run: >1GB RDF data/1h

■What‘s new:

■ crawl not only RDF/RDFS and OWL resources

■ include (X)HTML with RDFa extensions and

■ dynamic documents with (semantic) sitemaps

■What‘s next...?

Page 49: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

36

Problems and Experiments

(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be out there...

■ Structural Analysis of the ,early‘ WWW

IN44m nodes

SCC56m nodes

OUT44m nodes

unconnected components

unconnected components

tunnels

appendices

appendices

A. Broder et al.: Graph structure in the Web. In Comput. Netw. 33, 1-6 (Jun. 2000), 309-320.

Page 50: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

37

Problems and Experiments

(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be there...

■ Structural Analysis of the ,early‘ Semantic Web

Weiyi Ge et al.: Object Link Structure in the Semantic Web, ESWC 2010

■ Experimental Setup

■ 18m RDF documents (Falcons crawl 2009)

■ 110m nodes with 190m edges■ Analysis of RDF link graph

■ average node degree: ≈3.4

■ effective diameter: ≈11.5

■ Largest connected component: ≈88% of all nodes

Page 51: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

38

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

http://pedantic-web.org/

A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010

■ 150k documents with more than 12m RDF triples

■ Discovered categories of symptoms:

■ incomplete → dead links

■ incoherent → no correct interpretation (local)

■ hijack → no correct interpretation (remote)

■ inconsistent → contradictions

Page 52: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

39

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello San Canzian d‘IsonzohasBirthPlace

Page 53: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

39

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

San Canzian d‘IsonzohasBirthPlace

Page 54: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

39

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

PlacehasBirthPlace

San Canzian d‘IsonzohasBirthPlace

Page 55: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

39

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

PlacehasBirthPlace

San Canzian d‘IsonzohasBirthPlace

class membershipcan be deduced

Page 56: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

40

Problems and Experiments

(4) Analyzing the Semantic Web III - Data Cleansing■ trying to clean out Linked Open Data and possibly also (partially) the

Semantic Web...

(1) Identify inconsistencies and ambiguities by (automated) content-based analysis

(2)Solve inconsistencies & ambiguities

■ if possible by reasoning

■ else by crowdsourcing (game-based evaluation, etc.)

Cleaning out the Augean stables...AUGEAN-STABLES: Extremely nasty and smelly warehouses of filth, straw and manure

Page 57: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

41

Problems and Experiments

(5) Analyzing the Semantic Web IV - Data Ranking■ Linked Data provides (unbiased) knowledge

■ unbiased = no distinction of what is important, what is not important

■ e.g., Albert Einstein■ > 600 facts (triples)■ > 80 properties■ no ranking■ no relevance

http://dbpedia.org/page/Albert_Einstein

Page 58: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

42

Problems and Experiments

(5) Analyzing the Semantic Web IV - Data Ranking■We have developed heuristics for ranking objects and properties,

e.g.

:Albert_Einstein

:AmericanVegetarian

rdf:type

:Scientistrdf:type

Page 59: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

42

Problems and Experiments

(5) Analyzing the Semantic Web IV - Data Ranking■We have developed heuristics for ranking objects and properties,

e.g.

:Albert_Einstein

:AmericanVegetarian

rdf:type

:Alfred_Kleiner

rdf:type

:Scientistrdf:type

Page 60: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

42

Problems and Experiments

(5) Analyzing the Semantic Web IV - Data Ranking■We have developed heuristics for ranking objects and properties,

e.g.

:Albert_Einstein

:AmericanVegetarian

rdf:type

:Alfred_Kleiner

rdf:type

:Scientistrdf:type :Bill_Cosby

rdf:type

Page 61: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

42

Problems and Experiments

(5) Analyzing the Semantic Web IV - Data Ranking■We have developed heuristics for ranking objects and properties,

e.g.

:Albert_Einstein

:AmericanVegetarian

rdf:type

:Alfred_Kleiner

rdf:type

:Scientistrdf:type :Bill_Cosby

rdf:type

:doctoralAdviser

considered to be relevant

Page 62: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

43

Problems and Experiments

(6) Semantic Web Infrastructure - Tripel Stores■ RDF(S) Data is stored in Triple Stores

■ Basic idea:

■ Use 1 table with 3 columns (s,p,o)

■ For every row / row combinationcreate index structures for fast access(spo, sop, pos, pso, ops, osp)

■ Drawback: many self-joins needed(memory consumption)

Page 63: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

44

Problems and Experiments

Experiment Summary (1) Crawling the Semantic Web

(2) Structural Analysis

(3) Content-based Analysis

(4) Data Cleansing

(5) Heuristics for Ranking Semantic Web Data

(6) Augmenting Semantic Web Infrastructure

Page 64: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

45

• Topic: Semantic Web and Linked Data

•Problem Defintion and Experiments

•Application: Exploratory Multimedia Search

Linked Open Data Universe

Page 65: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

46

http://www.yovisto.com

Application: Exploratory Multimedia Search

Yovisto semantic video search engine

■specialized on academic video content, e.g., lecture recordings

■enables to search within the content of video

■ automated video analysis: video scene cut detection, intelligent character recognition, complemented by collaborative user annotation

■more than 8.000h of video

Semantic Metadata:

■ Ontology: http://www.yovisto.com/ontology/0.9/

■ DBpedia, FOAF, DublinCore, MPEG-7, Tagging

■ RDFa annotation

■ public SPARQL Endpoint: http://sparql.yovisto.com/J. Waitelonis, H. Sack: Augmenting Video Search with Linked Open Data, in Proc. of International Conference on Semantic Systems 2009 (i-semantics 2009), September, 2-4, 2009, Graz, Journal of Universal Computer Science

Page 66: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

47 ■Semantic Annotation

timeMetadata Extraction

Application: Exploratory Multimedia Search

Page 67: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

47 ■Semantic Annotation

timeMetadata Extraction

e.g., person xy

location yz

event abc

Entity Recognition/ Mapping

Application: Exploratory Multimedia Search

Page 68: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

47 ■Semantic Annotation

timeMetadata Extraction

e.g., bibliographical data,geographical data,encyclopedic data, ..

e.g., person xy

location yz

event abc

Entity Recognition/ Mapping

Application: Exploratory Multimedia Search

Page 69: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

Exploratory Search• Is a kind of investigation task, where the user is

(a) not familiar with the domain of the search result,i.e. before entering appropriate keywords, she needs to learn about the domain

(b) not sure about the way how to reach search destination (concerning search process and search technology)

(c) not really sure about what she’s looking for, i.e. “Can you please find something out about ... ?”.

48

White, R.W., Kules, B., Drucker, S.M., and schraefel, M.C.Supporting Exploratory Search, Introduction to Special Section of Communications of the ACM, Vol. 49, Issue 4, (2006), pp. 36-39.

„Which modern philosophers build on the theories of the greek philosopher Plato?“

Application: Exploratory Multimedia Search

Page 70: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

29

history

search term

related resources with properties

Waitelonis, Sack: Augmenting Video Search with Linked Open Data, in Proc. I-Semantics , Graz 2009.

Page 71: Linked Data Universe - Large Scale Computing Tasks for the HPI FutureSOC-Lab

JHarald Sack, 5th Annual Symposium on Future Trends in Service-Oriented Computing, June 16th, 2010, HPI, Potsdam

2010

50

• Topic: Semantic Web and Linked Data

•Problem Defintion and Experiments

•Application: Exploratory Multimedia Search

Linked Open Data Universe

Thank you for your Attention!