Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

Preview:

DESCRIPTION

Knowledge Harvesting f rom Text and Web Sources. Part 2: Search and Ranking of Knowledge. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. The Web Speaks to Us. Source: DB & IR methods for knowledge discovery. Communications of - PowerPoint PPT Presentation

Citation preview

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting from Text and Web Sources

Part 2: Search and Ranking of Knowledge

The Web Speaks to Us

• Web 2012 contains more DB-style data than ever• getting better at making structured content explicit: entities, classes (types), relationships• but no (hope for) schema here!

Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009

2-2

Structure Now!

Bob Dylan

CarlaBruni

Nicolas Sarkozy

JoanBaez

Bob Dylan

Bob Dylan

France

ChampsElysees

Grammy

Nicolas Sarkozy

Paris

Actor AwardChristoph Waltz OscarSandra Bullock OscarSandra Bullock Golden Raspberry…

Movie ReportedRevenueAvatar $ 2,718,444,933The Reader $ 108,709,522 Facebook FriendFeedSoftware AG IDS Scheer…

Company CEOGoogle Eric SchmidtYahoo OvertureFacebook FriendFeedSoftware AG IDS Scheer…

Knowledge bases with factsfrom Web and IE witnesses

IE-enriched Web pages with embedded entities and facts

2-3

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Distributed Structure: Linking Open Data30 Bio. triples500 Mio. links

2-4

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

o

wl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rd

f:type

rdfs:subclassOf

yago/wordnet:Actor109765278

rd

f:type

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

2-5

Entity Search

http://entitycube.research.microsoft.com/

2-6

Semantic Search with Entities, Classes, Relationships

Politicians who are also scientists?European composers who won the Oscar?Chinese female astronauts?

Enzymes that inhibit HIV? Antidepressants that interfere with blood-pressure drugs?German philosophers influenced by William of Ockham?

US president when Barack Obama was born?Nobel laureate who outlived two world wars and all his children?

Commonalities & relationships among:Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta?

FIFA 2010 finalists who played in a Champions League final?German football clubs that won against Real Madrid?

instances of classes

properties of entity

relationships

multiple entities

applications

2-7

Quiz Time

2-8

Rank the Folllowing Numbers

# ants on the Earth

# RDF triples in LOD

# Wikipedia links

# followers of Lady Gaga

# Google queries per day

# Yuan by Mark Zuckerberg

12*106

3*109

80*109

1015

25*109

600*106

1

2

4

6

5

3

2-8

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-9

RDF: Structure, Diversity, No Schema

• SPO triples: Subject – Property/Predicate – Object/Value)• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained ER graph• popular for Linked Data, comp. biology (UniProt, KEGG, etc.)• open-source engines: Jena, Sesame, RDF-3X, etc.

EnnioMorricone RomebornIn

Rome ItalylocatedIn

SPO triples (statements, facts):(EnnioMorricone, bornIn, Rome)(Rome, locatedIn, Italy)(JavierNavarrete, birthPlace, Teruel)(Teruel, locatedIn, Spain)(EnnioMorricone, composed, l‘Arena)(JavierNavarrete, composerOf, aTale)

CityRomeinstanceOf

(uri1, hasName, EnnioMorricone)(uri1, bornIn, uri2)(uri2, hasName, Rome)(uri2, locatedIn, uri3)…

bornIn (EnnioMorricone, Rome) locatedIn(Rome, Italy)

Facts about Facts

• temporal annotations, witnesses/sources, confidence, etc. can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: Col Sub Prop Obj)

facts: (EnnioMorricone, composed, l‘Arena)

(JavierNavarrete, composerOf, aTale)

(Berlin, capitalOf, Germany)

(Madonna, marriedTo, GuyRitchie)

(NicolasSarkozy, marriedTo, CarlaBruni)

facts:1:

2:

3:

4:

5:

temporal facts:6: (1, inYear, 1968)

7: (2, inYear, 2006)

8: (3, validFrom, 1990)

9: (4, validFrom, 22-Dec-2000)

10: (4, validUntil, Nov-2008)

11: (5, validFrom, 2-Feb-2008)provenance:12: (1, witness, http://www.last.fm/music/Ennio+Morricone/)

13: (1, confidence, 0.9)

14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie)

15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer))

16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php)

17: (10, confidence, 0.1)

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

o

wl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rd

f:type

rdfs:subclassOf

yago/wordnet:Actor109765278

rd

f:type

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

2-12

SPARQL Query LanguageSPJ combinations of triple patterns(triples with S,P,O replaced by variable(s)) Select ?p, ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . }

+ filter predicates, duplicate handling, RDFS types, etc.

Select Distinct ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name ?n . ?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }

Semantics:return all bindings to variables that match all triple patterns(subgraphs in RDF graph that are isomorphic to query graph)

Querying the Structured WebStructure but no schema: SPARQL well suited

wildcards for properties (relaxed joins): Select ?p, ?c Where { ?p instanceOf Composer . ?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . }

Extension: transitive paths [K. Anyanwu et al.: WWW‘07] Select ?p, ?c Where { ?p instanceOf Composer . ?p ??r ?c . ?c isa Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter (containsAny(??r,?t ) . ?t isa City . }

Extension: regular expressions [G. Kasneci et al.: ICDE‘08] Select ?p, ?c Where { ?p instanceOf Composer . ?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }

flexiblesubgraphmatching

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: triples match struct. pred.witnesses match text pred.

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

French politicians married to Italian singers?

Select ?p1, ?p2 Where { ?p1 instanceOf politician [France] . ?p2 instanceOf singer [Italy] . ?p1 marriedTo ?p2 . }

Grouping ofkeywords or phrasesboosts expressiveness

Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [computer science] . ?a workedOn ?x [Manhattan project] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [computer science] . ?a ?p2 ?o2 [Manhattan project] .?r ?p3 ?a . }

Relatedness Queries

Schema-agnostic keyword search(on RDF, ER graph, relational DB)becomes a special case

Relationship between Jet Li, Steve Jobs, Carla Bruni?

Select ??p1, ??p2, ??p3 Where {LiLianjie ??p1 SteveJobs. SteveJobs ??p2 CarlaBruni .CarlaBruni ??p3 LiLianjie . }

Select ??p1, ??p2, ??p3 Where {?e1 ?r1 ?c1 [“Li Lianjie“] . ?e2 ?r2 ?c2 [“Steve Jobs“] . ?e3 ?r3 ?c3 [“Carla Bruni“] . ?e1 ??p1 ?e2 .?e2 ??p2 ?e3 .?e3 ??p3 ?e1 . }

Querying Temporal Facts

• Consider temporal scopes of reified facts• Extend Sparql with temporal predicates

Managers of German clubs who won the Champions League?Select ?m Where { isa soccerClub . ?c inCountry Germany .

?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . ?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u .

[?s,?u] overlaps [?t,?t] . }

Problem: not all facts hold forever (e.g. CEOs, spouses, …)

When did a German soccer club win the Champions League?Select ?c, ?t Where { ?c isa soccerClub . ?c inCountry Germany .?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . }

1: (BayernMunich, hasWon, ChampionsLeague)2: (BorussiaDortmund, hasWon, ChampionsLeague)3: (1, validOn, 23May2001) 4: (1, validOn, 15May1974) 5: (2, validOn, 28May1997)6: (OttmarHitzfeld, manages, BayernMunich)7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004)

[Y. Wang et al.: EDBT’10][O. Udrea et al.: TOCL‘10

Querying with Vague Temporal Scope

• Consider temporal phrases as text conditions• Allow approximate matching and rank results wisely

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Problem: user‘s temporal interest is often imprecise

German Champion League winners in the nineties?Select ?c Where {

?c isa soccerClub . ?c inCountry Germany .?c hasWon ChampionsLeague [nineties] . }

Soccer final winners in summer 2001?Select ?c Where {

?c isa soccerClub . ?id: ?c matchAgainst ?o [final] .?id winner ?c [“summer 2001“] . }

Keyword Search on Graphs

Example: Conferences (CId, Title, Location, Year) Journals (JId, Title)CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person)Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95

Schema-agnostic keyword search over multiple tables:graph of tuples with foreign-key relationships as edges

[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …]

Result is connected tree with nodes that contain as many query keywords as possible

Ranking: 1)(1)1(),(),(

eedgesnnodes

eedgeScoreqnnodeScoreqtrees

with nodeScore based on tf*idf or prob. IRand edgeScore reflecting importance of relationships (or confidence, authority, etc.)

Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)

2-20

Best Result: Group Steiner Tree

Result is connected tree with nodes that contain as many query keywords as possible

Group Steiner tree: • match individual keywords terminal nodes, grouped by keyword• compute tree that connects at least one terminal node per keyword and has best total edge weight

y

x

x

y

zw

w w

x

y

zw

for query: x w y z2-21

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-22

Ranking CriteriaConfidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)

livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Diversity:Prefer variety of facts

Einstein won NobelPrizeBohr won NobelPrize

Einstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

E won … E discovered … E played … E won … E won … E won … E won …

Ranking ApproachesConfidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical LM with estimations from:

frequency in answer frequency in corpus (e.g. Web) frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models

Diversity:Prefer variety of facts

empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models

graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., G.Kasneci et al., …]

or

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 isMarriedTo ?a2 .?a1 actedIn ?m .?a2 actedIn ?m .

http://www.mpi-inf.mpg.de/yago-naga/rdf-express/

Example: RDF-Express

?a1 isMarriedTo ?a2 {Bollywood} .?a1 actedIn ?m .?a2 actedIn ?m .

Example: RDF-Express

?a1 isMarriedTo ?a2 .?a1 actedIn ?m {thriller} .?a2 actedIn ?m {love} .

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 directed ?m .?a2 actedIn ?m .?a1 hasWonPrize Academy_Award .?a2 type wordnet_musician .

Information Retrieval Basics[Textbooks by R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …]

......

.....

......

.....

crawlextract& clean

index search rank present

strategies forcrawl schedule andpriority queue for crawl frontier

handle dynamic pages,detect duplicates,detect spam

build and analyzeWeb graph,index all tokensor word stems

server farm with 10 000‘s of computers,distributed/replicated data in high-performance file system,massive parallelism for query processing

fast top-k queries,query logging,auto-completion

scoring functionover many dataand context criteria

GUI, user guidance,personalization

Ranking bydescendingrelevance

Vector Space Model for Content Relevance Ranking

Search engine

Query (set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

i

qd

qd

qdsim

Similarity metric:

2-30

Vector Space Model for Content Relevance Ranking

Search engine

Query (Set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

i

qd

qd

qdsim

Similarity metric:Ranking bydescendingrelevance

k ikijij wwd 2/:

iikk

ijij fwithdocs

docs

dffreq

dffreqw

#

#log

),(max

),(1log:

tf*idfformula:term frequency *document frequency

2-31

Statistical Language Models (LM‘s)

qLM(1)

d1

d2

LM(2)

?

?

• each doc di has LM: generative prob. distr. with params i

• query q viewed as sample from LM(1), LM(2), …

• estimate likelihood P[ q | LM(i) ]

that q is sample of LM of doc di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)

[Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001]

„God does not play dice“ (Albert Einstein)

„IR does“ (anonymous)

„God rolls dice in places where you can‘t see them“ (Stephen Hawking)

4-33

LM: Doc as Model, Query as Sample

A A

C

A

D

E E E E

C C

B

A

E

B

model M

document d: sample of Mused for parameter estimation

P [ | M]A A B C E E

estimate likelihoodof observing query

query

)(

||21

)()(...)()(

||]|[

qfjqj

q

jdpjfjfjf

qdqP

multinomial prob. distribution

LM: Need for Smoothing

A A

C

A

D

E E E E

C C

B

A

E

B

model M

document d

P [ | M]A B C E F

estimate likelihoodof observing query

query

+ background corpus and/or smoothing

used for parameter estimation

C

AD

A

BE

F

+

Laplace smoothingJelinek-MercerDirichlet smoothing…

2-34

Some LM Basics

i i dqPdqPqds ]|[]|[),(simple MLE: overfitting

][)1(]|[),( qPdqPqds

ikk

kdf

idf

dktf

ditf

)(

)()1(

),(

),(log~

i

k

kidf

kdf

dktf

ditf

)(

)(

1),(

),(1log~

mixture modelfor smoothing

i diP

qiPqiPdqKL

]|[

]|[log]|[)|(~

KL divergence(Kullback-Leibler div.)aka. relative entropy

tf*idffamily

ik

dktf

ditf

),(

),(log~

independ. assumpt.

efficientimplementation

• Precompute per-keyword scores • Store in postings of inverted index• Score aggregration for (top-k) multi-keyword query

P[q] est. fromlog or corpus

IR as LM Estimation

P[R|d,q]user likes doc (R)given that it has features dand user poses query q

],|[

],|[~

qRdP

qRdP

prob. IR

][]|,[~ RPRdqP

][]|[],|[ RPRdPRdqP

]|[~ dqP statist. LM

qj d ]|j[Plog]d|q[Plog)d,q(s

query likelihood:

]d|q[Plogargmax-k d

top-k query result:

MLE would be tf2-36

Multi-Bernoulli vs. Multinomial LM

Multi-Bernoulli:)q(X1

j)q(X

jjj ))d(p1()d(p]d|q[P

with Xj(q)=1 if jq, 0 otherwise

Multinomial:

)(

||21

)()(...)()(

||]|[ qf

jqjq

jdpjfjfjf

qdqP

with fj(q) = f(j) = frequency of j in q

multinomial LM more expressive and usually preferred2-37

LM Scoring by Kullback-Leibler Divergence

)(

||2122 )(

)(...)()(

||log]|[log qf

jqjq

jdpjfjfjf

qdqP

)(log)(~ 2 dpqf jjqj

))(),(( dpqfH neg. cross-entropy

))(())(),((~ qfHdpqfH ))(||)(( dpqfD

)(

)(log)( 2 dp

qfqf

j

j

j j neg. KL divergenceof q and d

2-38

Jelinek-Mercer SmoothingIdea:use linear combination of doc LM withbackground LM (corpus, common language);

could also consider query log as background LMfor query||

),()1(

||

),()(ˆ

C

Cjfreq

d

djfreqdp j

parameter tuning of by cross-validation with held-out data:• divide set of relevant (d,q) pairs into n partitions• build LM on the pairs from n-1 partitions• choose to maximize precision (or recall or F1) on nth

partition• iterate with different choice of nth partition and average

2-39

Jelinek-Mercer Smoothing:Relationship to tf*idf

]q[P)1(]d|q[P]|q[P

qikk

)k(df

)i(df)1(

)d,k(tf

)d,i(tflog~

qi

k

k)i(df

)k(df

1)d,k(tf

)d,i(tf1log~

with absolutefrequencies tf, df

relative tf ~ relative idf

2-40

Dirichlet-Prior Smoothing

|d|

]C|j[P

|d|

]d|j[P|d|

mn

1f

i

ii)(Mmaxargˆ)d(p̂ jj

with i set to P[i|C]+1 for the Dirichlet hypergeneratorand > 1 set to multiple of average document length

Dirichlet (): 1jm..1j

jm..1j

jm..1jm1

j

)(

)(),...,(f

with

m..1j j 1

(Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters)

d]][|f[P

][P]|f[P]f|[P:)(M

Posterior distr. with Dirichlet distribution as prior

)f(Dirichlet with term frequencies fin document d

MAP (Maximum Posterior) for

2-41

Dirichlet-Prior Smoothing:Relationship to Jelinek-Mercer Smoothing

|d|

]C|j[P

|d|

]d|j[P|d| ]|[)1(]|[)(ˆ CjPdjPdp j

with

|d|

|d|

where 1= P[1|C], ..., m= P[m|C] are the parameters

of the underlying Dirichlet distribution, with constant > 1typically set to multiple of average document length

with MLEsP[j|d], P[j|C]

tf

j

fromcorpus

2-42

Entity Search with LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …]

LM (entity e) = prob. distr. of words seen in context of e

][)1(]|[),( qPeqPqes ]q[P

]e|q[P~

i

ii

query q: „French player who won world championship“

candidate entities:

e1: David Beckham

e2: Ruud van Nistelroy

e3: Ronaldinho

e4: Zinedine Zidane

e5: FC Barcelona

played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …

weightedby conf.

Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...

))e(|)q((KL~ LMLM

query: keywords answer: entities

LM‘s: from Entities to Facts

Document / Entity LM‘s

Triple LM‘s

LM for doc/entity: prob. distr. of words

LM for query: (prob. distr. of) words

LM‘s: rich for docs/entities, super-sparse for queries

richer query LM with query expansion, etc.

LM for facts: (degen. prob. distr. of) triple

LM for queries: (degen. prob. distr. of) triple pattern

LM‘s: apples and oranges• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !

LM‘s for Triples and Triple Patterns

f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid

triples (facts f):triple patterns (queries q):q: Beckham p ?y

200 300 20 30300150 20200350 10400200150100150 20

: 2600

q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan

200/550300/550 20/550 30/550

witness statistics

q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500

q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30

LM(q) + smoothing

q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…

LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))

[G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11]

LM‘s for Composite Queriesq: Select ?x,?c Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20f23: F ml Henry 200f24: F ml Ribery 200f25: F ml Drogba 30f26: IC ml Drogba 100f27 ALG ml Zidane 50

queries q with subqueries q1 … qn

results are n-tuples of triples t1 … tn

LM(q): P[q1…qn] = i P[qi]

LM(answer): P[t1…tn] = i P[ti]

KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

P [ F ml Henry, Henry p Arsenal, Arsenal in UK ]

500

160

2600

200

650

200~

P [ F ml Drogba, Drogba p Chelsea, Chelsea in UK ]

500

140

2600

150

650

30~

LM‘s for Keyword-Augmented Queries

q: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }

subqueries qi with keywords w1 … wm

results are still n-tuples of triples ti

LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]

LM(answer fi) analogous

KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))

result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords

LM‘s for Temporal Phrases

q: Select ?c Where { ?c instOf nationalTeam . ?c hasWon WorldCup [nineties] . }

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

July 1994mid 90s

extract temp expr‘s xj

from witnesses& normalize

[K.Berberich et al.: ECIR‘10]

“mid 90s“ xj = [1Jan93,31Dec97] normalize

temp expr vin query

“nineties“ v = [1Jan90,31Dec99]

P[q | t] ~ … j P[v | xj]

~ … j { P[ [b,e] | xj] | interval [b,e]v}

~ … j overlap(v,xj) / union(v,xj)

• enhanced ranking• efficiently computable• plug into doc/entity/triples LM‘s

lastcentury

summer 1990

1998

P[v=„nineties“ | xj = „mid 90s“] = 1/2P[„nineties“ | „summer 1990“] = 1/30P[„nineties“ | „last century“] = 1/10

Query Relaxationq: … Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20F23: F ml Henry 200F24: F ml Ribery 200F26: IC ml Drogba 100F27 ALG ml Zidane 50

[ F ml Zidane, Zidane p Real, Real in ESP ]

q(1): … Where { France ml ?x . ?x p ?c . ?c in ?y . }

[ IC ml Drogba, Drogba p Chelsea, Chelsea in UK] [ F resOf Drogba,

Drogba p Chelsea, Chelsea in UK] [ IC ml Drogba,

Drogba p Chelsea, Chelsea in UK]

q(2): … Where { ?x ml ?x . ?x p ?c . ?c in UK . }q(3): … Where { France ?r ?x . ?x p ?c . ?c in UK . }q(4): … Where { IC ml ?x . ?x p ?c . ?c in UK . }

LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …

replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))

replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...

are prob. distr.‘s of triples !

Result Personalization

Open issue: „insightful“ results (new to the user)

q

q

q1

q2

q3f3

f4

f1f2f5

q3

q4 q5f3

q6

f6

same answer for everyone?

Personal histories ofqueries & clicked facts LM(user u): prob. distr. of triples !

[S. Elbassuoni et al.: PersDB‘08]

u1 [classical music] q: ?p from Europe . ?p hasWon AcademyAward

u2 [romantic comedy] q: ?p from Europe . ?p hasWon AcademyAward

u3 [from Africa] q: ?p isa SoccerPlayer . ?p hasWon ?a

LM(q|u] = LM(q) + (1) LM(u)then business as usual

Result Diversification

q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon

rank results f1 ... fk by ascending

KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool

[J. Carbonell, J. Goldstein: SIGIR‘98]

Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]

EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.):

• define authority transfer graph

among entities and pages with edges:• entity page if entity appears in page• page entity if entity is extracted from page• page1 page2 if there is hyperlink or implicit link between pages• entity1 entity2 if there is a semantic relation between entities

• edges can be typed and (degree- or weight-) normalized and

are weighted by confidence and type-importance• also applicable to graph of DB records with foreign-key relations

(e.g. bibliography with different weights of publisher vs. location for conference record)

• compared to standard Web graph, ER graphs of this kind

have higher variation of edge weights

2-52

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-53

54/34

Scalable Semantic Web: Pattern Queries on Large RDF Graphs

schema-free RDF triples: subject-property-object (SPO) example: Einstein hasWon NobelPrizeSPARQL triple patterns: Select ?p,?c Where { ?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe}large join queries, unpredictable workload,difficult physical design, difficult query optimization

Einstein hasWon NobelEinstein bornIn UlmRonaldo hasWon FIFASpain partOf EuropeFrance partOf Europe… … .,.

S P O S O Einstein NobelRonaldo FIFA… …

hasWon

S hasWon bornIn .,.

Person

Einstein Nobel Ulm .,.Ronaldo FIFA Rio .,..,. .,. .,. .,.

Semantic-Web engines (Sesame, Jena, etc.)did not provide scalable query performance

CountryS partOf capital .,..,. .,. .,. .,.

S O … …

bornIn

AllTriples

?p ?b ?t

Scalable Semantic Web: RDF-3X Engine[T. Neumann et al.: VLDB 2008, SIGMOD 2009, VLDBJ’10]

• RISC-style, tuning-free system architecture

• map literals into ids (dictionary) and precompute

exhaustive indexing for SPO triples:

SPO, SOP, PSO, POS, OSP, OPS,

SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression

• efficient merge joins with order-preservation

• join-order optimization

by dynamic programming over subplan result-order

• statistical synopses for accurate result-size estimation

http://code.google.com/p/rdf3x/

http://www.mpi-inf.mpg.de/~neumann/rdf3x/

56/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

|| [S=S]

|| [O=S]

poorjoin order

57/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe good

join order

58/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

sidewaysinformationpassing

run-time filters

Join-Order Optimization for SPARQL

mehasFriend

gender

livesIn

Berlin

?f ?s ?p ?a

?x ?y

singer

Berlin

performedIn

inC

ity

2009

inYe

ar

likes

BobDylan

composedBy

prot

est antiwar

female

Fine-grained RDF (e.g. over social-tagging data)often entails complex queries with 10, 20 or more joins

Join-order optimization operates on join graph:nodes are triple patterns, edges denote shared variables• need selectivity (result cardinality) estimator • join-order optimization is O(n3) for chains, O(2n) for stars• exact optimization uses dynamic programming over sub-graphs and the sort order of intermediate results

[T. Neumann: VLDBJ‘10]

?f gender female .?f hasFriend me .?f livesIn Berlin .?f likes ?s .?s taggedProtestBy ?x .?s taggedAntiwarBy ?y .?s composedBy BobDylan .?s performedIn ?p .?p inCity Berlin .?s inYear 2009 .?p singer ?a .

60/34

Experimental Evaluation: Setup

Setup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by MIT folks, using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Benchmark queries such as:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }

[T. Neumann: SIGMOD‘09]

61/34

Experimental Evaluation: Results

YAGO dataset(40 Mio. triples, 3.1 GB):2.7 GB in RDF-3X, loaded in 25 min

Librarything dataset(36 Mio. triples, 1.8 GB):1.6 GB in RDF-3X, loaded in 20 min

measurements on Uniprot (845 Mio. triples) and Billion-Triples (562 Mio. triples)confirm these experimental results

exe

cuti

on

tim

e [s

]

exe

cuti

on

tim

e [s

]

[T. Neumann: SIGMOD‘09]

Real

Madrid

Özil

Distributed RDF Stores & Sparql at Scale[D. Abadi et al.: VLDB‘11]

• Partition data graph by hashing triples on subject• Enhance locality by replicating N-hop neighbors with each triple• Can automatically determine when query is parallelizable without communication• For remaining cases: distributed joins, semi-joins, etc.

Iniesta

FC Barca

Xavi

Barcelona

?player

?pos

?pop

?city?team

> 1 000 000

locatedIn

bornIn

playsPos

playsForhasPop

midfielderMadrid

Rosario

Real

Gelsenkirchen

Casillas

Özil

Messi

Casillas

Iniesta

Messi

Xavi

Özil

FC Barca

Real

Barcelona

Madrid

Rosario

Gelsenkirchen

Top-k Query Processing for Ranked Results

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

t1d780.9

d10.7

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

fetch listsjoin&sort

simple & DB-style;needs only O(k) memory;for monotone score aggr.

2-63

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

t1d780.9

d880.2d780.1d340.1

d230.8

d100.8

d1d1

t2d640.9

d230.6

d100.6

t3d100.7

d780.5

d640.3

Threshold algorithm (TA):scan index lists; consider d at posi in Li;highi := s(ti,d);if d top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m};if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’ top-k};threshold := aggr {high | =1..m};if threshold min-k then exit;

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 2Scan

depth 3Scan

depth 3

k = 2

simple & DB-style;needs only O(k) memory;for monotone score aggr.

Scan depth 4Scan

depth 4

d10.7

d990.2

d120.2 2 d64 0.9

Rank Doc Score

2 d64 1.2

1 d78 0.91 d78 1.51 d78 1.5

2 d64 0.9

Rank Doc Score

2 d78 1.5

1 d10 2.1Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

STOP!STOP!

Threshold Algorithm (TA) [Fagin 01, Güntzer 00, Nepal 99, Buckley85]

2-64

TA with Sorted Access Only (NRA) [Fagin 01, Güntzer et al. 01]

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

Rank Doc Worst-score

Best-score

1 d78 0.9 2.4

2 d64 0.8 2.4

3 d10 0.7 2.4

Rank Doc Worst-score

Best-score

1 d78 1.4 2.0

2 d23 1.4 1.9

3 d64 0.8 2.1

4 d10 0.7 2.1

Rank Doc Worst-score

Best-score

1 d10 2.1 2.1

2 d78 1.4 2.0

3 d23 1.4 1.8

4 d64 1.2 2.0

t1d780.9

d10.7

d880.2

d120.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!STOP!

No-random-access algorithm (NRA):scan index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d};threshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 2Scan

depth 3Scan

depth 3

k = 1

sequential access (SA) faster than random access (RA)by factor of 20-1000

2-65

Query Expansion (Synonyms, Related Terms, …)

magician

wizard

intellectual

artist

alchemist

directorprimadonna

lecturer

professor

teacher

educator

scholar

academic, academician, faculty member

scientist

researcher

HYPONYM (0.749)HYPONYM (0.749)

Thesaurus/Ontology:concepts, relationships, glosses from WordNet, Wikipedia, KB, Web, etc.

relationships quantified bystatistical correlation measurese.g. Jaccard coefficients: |XY| / |XY|

Query expansion

Weighted expanded queryExample:(professor lecturer (0.749) scholar (0.71) ...)and Germanyand ( (course class (1.0) seminar (0.84) ... ) and („IR“ „Web search“ (0.653) ... ) )

Term2Concept with WSD

exp(ti)={w | sim(ti,w) }

Efficient top-k searchwith dynamic expansion

better recall, better meanprecision for hard queries

investigator

mentor

Problems:• efficiency• tuning the threshold

User query: ~t1 ... ~tmExample:~professor Germany ~course ~IR

2-66

Query Expansion with Incremental Merging

relaxable query q: ~professor researchwith expansions based on ontology relatedness modulatingmonotonic score aggregation

Better: dynamic query expansion with incremental merging of additional index lists

efficient, robust, self-tuning

lecturer: 0.7

37: 0.944: 0.8

...

22: 0.723: 0.651: 0.652: 0.6

scholar: 0.692: 0.967: 0.9

...

52: 0.944: 0.855: 0.8

research

B+ tree index on terms

57: 0.644: 0.4

...

professor

52: 0.433: 0.375: 0.3

12: 0.914: 0.8

...

28: 0.617: 0.5561: 0.544: 0.5

44: 0.4

ontology / meta-index

professorlecturer: 0.7scholar: 0.6academic: 0.53scientist: 0.5...

exp(i)={w | sim(i,w) }

primadonna

teacher

investigator

magician

wizard

intellectual

artist

alchemist

director

professor

scholar

academic,academician,faculty member

scientist

researcher

Hyponym (0.749)Hyponym (0.749)

mentor

Related (0.48)Related (0.48)

lecturerTA scans of index lists for iq exp(i)

[M. Theobald et al.: SIGIR 2005]

2-67

Query Expansion with Incremental Merging[M. Theobald et al.: SIGIR 2005]

Principles of the algorithm:• organize possible expansions Exp(t)={w | sim(t,w) , tq} in priority queues based on sim(t,w) for each t

for given q = {t1, tm}: • keep track of active expansions for each ti: ActExp(ti) = {w1(ti), w2(ti), ...w j(i)(ti)}

~ position j(i) in the priority queue for Exp(ti)

• scan index lists for t1, w1(t1), ..., wj(1)(t1), ..., tm, w1(tm), ..., wj(m)(tm)

maintaining cursors pos() and score bounds high() for each list • in each scan step do for each ti:

advance cursor of list for w(ti) ActExp(ti)

for which best := high()*sim(w,ti) is largest among all lists of ti

or add next largest w(ti) to ActExp(ti) if sim(w,ti) > best

and proceed in this new list

2-68

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

research professor XML „information retrieval“ „query result ranking“

research prof XML informationretrieval

query result ranking

top-k top-k

multithreaded coroutine-like evaluation with synchronization of PQs

2-69

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

~research ~professor XML „information retrieval“ „query result ranking“

top-k top-k

research prof XML informationretrieval

query result ranking

top-k top-k

science scholar lecturer

multithreaded coroutine-style evaluation with synchronization of PQs

2-70

Top-k Rank Joins on Structured Data[Ilyas et al. 2008]

extend TA/NRA/etc. to ranked query results from structured data(improve over baseline: evaluate query, then sort)

Select R.Name, C.Theater, C.MovieFrom RestaurantsGuide R, CinemasProgram CWhere R.City = C.CityOrder By R.Quality/R.Price + C.Rating Desc

BlueDragon Chinese €15 SBHaiku Japanese €30 SBMahatma Indian €20 IGBMescal Mexican €10 IGBBigSchwenk German €25 SLS...

Name Type Quality Price City

RestaurantsGuide

BlueSmoke Tombstone 7.5 SBOscar‘s Hero 8.2 SBHolly‘s Die Hard 6.6 SBGoodNight Seven 7.7 IGBBigHits Godfather 9.1 IGB...

Theater Movie Rating City

CinemasProgram

process index lists on quality desc, price asc, rating desc2-71

Top-k Sparql Queries over RDF

?w wrote ?b . ?b hasTag Memoir .

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Mark_Twain wrote Roughing_it 0.999J_K_Rowling wrote Harry_Potter 0.982Stephanie_Meyer wrote Twilight 0.966Cormac_McCarthy wrote The_Road 0.864Jimmy_Carter wrote Palestine 0.732………………………………… …….

[S. Elbassuoni 2012]• Precompute score-sorted index lists for each S value, P, O, SP value pair, PO, SO• Perform rank-join over index lists

2-72

Top-k Sparql Queries over RDF + Text [S. Elbassuoni 2012]

?w wrote ?b[nobel prize]; ?b hasTag Memoir

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Jimmy_Carter wrote Palestine 0.712Albert_Camus wrote The_Rebel 0.667 Doris_Lessing wrote Alfred_and_Emily 0.660  ………………………………… …….

nobel

Ian_McEwan wrote On_Chesil_Beach 0.882Jimmy_Carter wrote Palestine 0.774 Cormac_McCarthy wrote The_Road 0.623 Jhumpa_Lahiri wrote The_Namesake 0.600………………………………… …….

prize

• Precompute score-sorted index lists for each S+keyword, P+keyword, O+…, SP+keyword, PO+…, SO+…• Perform rank-join over index lists

2-73

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-74

UI: Table Search by Example & Description Expect result table specify query in tabular form (QBE, QueryByDescription)

[S.Sarawagi et al.: VLDB‘09]

q1: David Beckham Real MadridZinedine Zidane JuventusLionel Messi FC Barcelona

Bayern Munich Real Madrid1.FC Kaiserslautern Real MadridBorussia Dortmund Real Madrid

q2: 2:15:05:1

match against table rows

infer query fromtable headers ???

QBE

QBE

soccer club from Germany and winner against Real Madridq2:

output class descriptionQBD

player club ???

???q1:QBD

UI: Structured Keyword Search Need to map (groups of) keywords onto entities & relationshipsbased on name-entity similarities/probabilities

q: Champions League finals with Real Madrid

[Ilyas et al. Sigmod‘10]

UEFA Champions League

Real Madrid C.F.League ofWrestlingChampions

final match

final exam

q: German football clubs that won (a match) against Real

more ambiguity, more sophisticated relation more candidates combinatorial complexity

Real Madrid C.F.

Real Zaragoza

Brazilian Real

Familia Real

Real Number

soccer

American football

social club

night club

sports team

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Which German soccer clubs have won against Real Madrid ?

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Which ?c

?c German soccer clubs ?r ?r have won against ?o

?o Real Madrid ?

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Select ?c Where { ?c isa SoccerClub . c? in Germany . }

?c wonAgainst RealMadrid . }?id: ?c playedAgainst RealMadrid . ?c isWinnerOf ?id .}

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Select ?c Where { ?c isa GermanSoccerClub . c? wonAgainst RealMadrid . }

SPARQL or Keywords or … ?

SPARQL query: Select ?p Where {?p created ?s . ?s type music . ?s contributesTo ?m . ?m type movie .?p bornIn Rome . }

Who composed scores for westerns and is from Rome?

Keywords query:composer score western Rome

or: {composer Rome} score westernor: …

Faceted search with interactive navigation:composer

Romescore music movie

western2-79

Natural Language Question AnsweringIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?

used Yago & Dbpedia for type checking

[IBM Journal of R&D 56(3-4), 2012]

question

QuestionAnalysis:ClassificationDecomposition

HypothesesGeneration(Search):AnswerCandidates

CandidateFiltering &Ranking

Hypotheses& EvidenceScoring

answer

Overall architecture of Watson (simplified)

2-80

Semantic Technologies in IBM WatsonIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?

used Yago & Dbpedia for type checking

[A. Kalyanpur et al.: ISWC 2011]

Semantic checking of answer candidatesquestion

candidatestring

ConstraintChecker

TypeChecker

lexicalanswer type

KB

RelationDetection

EntityDisambiguation& Matching

PredicateDisambiguation& Matching

candidatescore

KB instances

semantic types

spatial & temporalrelations

2-81

Scoring of Semantic Answer Types

[A. Kalyanpur et al.: ISWC 2011]

Check for 1) Yago classes, 2) Dbpedia classes, 3) Wikipedia lists

Match lexical answer type against class candidatesbased on string similarity and class sizes (popularity)Examples: Scottish inventor inventor, star movie star

Compute scores for semantic types, considering:class match, subclass match, superclass match,sibling class match, lowest common ancestor, class disjointness, …

no types Yago Dbpedia Wikipedia all 3Standard QAaccuracy 50.1% 54.4% 54.7% 53.8% 56.5%Watsonaccuracy 65.6% 68.6% 67.1% 67.4% 69.0%

see also: j-archive.com 2-82

From Questions to Queries

NL question:

Who composed scores for westerns and is from Rome?

scores for westerns

is from Rome Who composed scores

Dependency parsing exposes structure of question „triploids“ (sub-cues)

2-83

From Triploids to TriplesWho composed scores for westerns and is from Rome?

Who is from Rome

Who composed scores

scores for westerns

?x composed scores

?x bornIn Rome

scores contributesTo ?y

?y type westernMovie

?x type composer

?x composed ?s

?s contributesTo ?y

?s type music

2-84

Disambiguation Mapping for TriploidsWho composed scores for westerns and is from Rome?

composed

composedscores

scores for

westerns

is from

Rome

Who

q1

q2

q3

q4

Combinatorial Optimization by ILP (with type constraints etc.)

Rome (Italy)

Lazio Roma

person

musicianThe Who

createdwroteCompositionwroteSoftware

soundtracksoundtrackFor

shootsGoalFor

bornIn

actedIn

western movie

Western Digital

we

igh

ted

ed

ge

s (c

oh

ere

nc

e, s

imila

rity

, etc

.)

2-85

Relaxing Overconstrained QueriesSelect ?p Where {

?p composed ?s . ?s type music . ?s for ?m . ?m type movie .?p bornIn Rome . }

Select ?p Where {?p composed ?s . ?s type music . ?s for ?m . ?m type movie [western] .?p bornIn Rome . }

Select ?p Where {?p ?rel1 ?s [composed] . ?s type music . ?s ?rel2 ?m . ?m type movie [western] .?p bornIn Rome . }

with extended SPARQL-FullText: SPOX quad patterns

(S. Elbassuoni et al.: CIKM‘10, ESWC’11) 2-86

Preliminary Resultshttp://www.mpi-inf.mpg.de/yago-naga/deanna/

(M. Yahya et al.: WWW’12)

2-87

Preliminary Results

2-88

Preliminary Results

2-89

Preliminary Results

2-90

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-91

Take-Home LessonsSemantic Search over Entities and Relationships:Sparql is natural choice, but text extension is crucial

For ranking, LM‘s are effective, elegant, versatilestatistically principled & compositionalapplicable to entities, RDF, triples+text, temporal expr, …

keywords++, tables by example, natural language, speech, …

APIs becoming clear, UIs widely open

triple indexing, join order optimization, …top-k query processing

Efficient implementation techniques

2-92

Open Problems and Grand Challenges

Reconcile expressiveness of Sparql + extensionswith ease-of-use of natural-language QA

Efficient and effective (LM-based) Top-k Rankingfor entire Web of Linked Data (& Text)

Extended LM-based Ranking for Entitiesin Temporal & Spatial (& Social) Context

Extend Sparql with Text, Time, Space, … Influence W3C standard …

Distributed / federated query processingfor entire Web of Linked Data

2-93

Recommended Readings: Search & Ranking• G. Kasneci, F. Suchanek, G. Ifrim: NAGA: Searching and Ranking Knowledge. ICDE 2008• S. Elbassuoni, M. Ramanath, G. Weikum: Query Relaxation for Entity-Relationship Search. ESWC 2011• S. Elbassuoni, M. Ramanath, et al.: Language-model-based ranking for queries on RDF-graphs. CIKM 2009• Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.-Y. Ma: Web Object Retrieval. WWW 2007• H. Bast, A. Chitea, F. Suchanek, I. Weber: ESTER: efficient search on text, entities, and relations. SIGIR 2007• J. Pound, P. Mika, H. Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010• O. Udrea, D. Recupero, V.S. Subrahmanian: Annotated RDF. ACM Trans. Comput. Log. 11(2), 2010• V. Hristidis, H.Hwang, Y.Papakonstantinou: Authority-based keyword search in databases. TODS 33(1), 2008• T. Cheng, X. Yan, K. Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007• D. Vallet, H. Zaragoza: Inferring the Most Important Types of a Query: a Semantic Approach. SIGIR 2008• R. Blanco, H. Zaragoza: Finding support sentences for entities. SIGIR 2010• P. Serdyukov, D. Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR 2008• R. Kaptein, P. Serdyukov, A.P. de Vries, J. Kamps: Entity ranking using Wikipedia as a pivot. CIKM 2010• H. Fang, C. Zhai: Probabilistic Models for Expert Finding. ECIR 2007• D. Petkova, W.B. Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI 2006• M. Pasca: Towards Temporal Web Search. SAC 2008• K. Berberich, S.J. Bedathur, O. Alonso, G. Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25• J. Tappolet et al.: Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. ESWC 2009• C. Zhai: Statistical Language Models for Information Retrieval. Morgan&Claypool, 2008• B.C. Ooi (Ed.): Special Issue on Keyword Search, IEEE Data Eng. Bull. 33(1), 2010• J.X. Yu, L. Qin, L. Chang: Keyword Search in Databases Morgan & Claypool, 2010• S. Ceri, M. Brambilla (Eds.): Search Computing: Challenges and Directions, Springer, 2010• M. Arenas, S. Conca, J. Perez: Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. WWW 2012

2-94

Recommended Readings: Efficiency & UI• T. Neumann, G. Weikum: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 2010• J. Huang, D.J. Abadi, K. Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• L. Sidirourgos, R. Goncalves, M.L. Kersten, N. Nes, S. Manegold: Column-store support for RDF data management: not all swans are white. PVLDB 1(2), 2008• R. Delbru, S. Campinas, G. Tummarello: Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem. 10, 2012• G. Tummarello et al.: Sig.ma: live views on the web of data. WWW 2010• M. Schmidt, T. Hornung, G. Lausen, C. Pinkel: SP^2Bench: A SPARQL Performance Benchmark. ICDE 2009• M. Morsey, J. Lehmann, S. Auer, A.N. Ngomo: DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data. ISWC 2011• K. Hose, R. Schenkel, M. Theobald, G. Weikum: Database Foundations for Scalable RDF Processing. in: A. Polleres et al. (Eds.): Reasoning Web - Semantic Technologies for the Web of Data, Springer 2011• A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt: FedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011• J. Pound, I.F. Ilyas, G.E. Weddell: Expressive and flexible access to web-extracted data: a keyword-based structured query language. SIGMOD 2010 • R. Pimplikar, S. Sarawagi: Answering Table Queries on the Web using Column Keywords. PVLDB 5(10), 2012• A. Kalyanpur, J.W. Murdock, J. Fan, C.A. Welty: Leveraging Community-Built Knowledge for Type Coercion in Question Answering. ISWC 2011• C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano: Template-based question answering over RDF data. WWW 2012• M. Yahya. K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum: Natural Language Questions for the Web of Data. EMNLP 2012• D.A. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine 31(3), 2010• IBM Journal of Research and Development 56(3-4), Special Issue on “This is Watson”, 2012

2-95

Thank You!

2-96

Recommended