Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting from Text and Web Sources

Part 2: Search and Ranking of Knowledge

The Web Speaks to Us

• Web 2012 contains more DB-style data than ever• getting better at making structured content explicit: entities, classes (types), relationships• but no (hope for) schema here!

Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009

Structure Now!

Bob Dylan

CarlaBruni

Nicolas Sarkozy

JoanBaez

Bob Dylan

France

ChampsElysees

Grammy

Nicolas Sarkozy

Actor AwardChristoph Waltz OscarSandra Bullock OscarSandra Bullock Golden Raspberry…

Movie ReportedRevenueAvatar $ 2,718,444,933The Reader $ 108,709,522 Facebook FriendFeedSoftware AG IDS Scheer…

Company CEOGoogle Eric SchmidtYahoo OvertureFacebook FriendFeedSoftware AG IDS Scheer…

Knowledge bases with factsfrom Web and IE witnesses

IE-enriched Web pages with embedded entities and facts

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Distributed Structure: Linking Open Data30 Bio. triples500 Mio. links

rdf.freebase.com/ns/en.rome

owl:sameAs

wl:sameAs

data.nytimes.com/51688803696189142301

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

f:type

rdfs:subclassOf

yago/wordnet:Actor109765278

f:type

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

Entity Search

http://entitycube.research.microsoft.com/

Semantic Search with Entities, Classes, Relationships

Politicians who are also scientists?European composers who won the Oscar?Chinese female astronauts?

Enzymes that inhibit HIV? Antidepressants that interfere with blood-pressure drugs?German philosophers influenced by William of Ockham?

US president when Barack Obama was born?Nobel laureate who outlived two world wars and all his children?

Commonalities & relationships among:Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta?

FIFA 2010 finalists who played in a Champions League final?German football clubs that won against Real Madrid?

instances of classes

properties of entity

relationships

multiple entities

applications

Quiz Time

Rank the Folllowing Numbers

# ants on the Earth

# RDF triples in LOD

# Wikipedia links

# followers of Lady Gaga

# Google queries per day

# Yuan by Mark Zuckerberg

12*106

80*109

25*109

600*106

Outline

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

RDF: Structure, Diversity, No Schema

• SPO triples: Subject – Property/Predicate – Object/Value)• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained ER graph• popular for Linked Data, comp. biology (UniProt, KEGG, etc.)• open-source engines: Jena, Sesame, RDF-3X, etc.

EnnioMorricone RomebornIn

Rome ItalylocatedIn

SPO triples (statements, facts):(EnnioMorricone, bornIn, Rome)(Rome, locatedIn, Italy)(JavierNavarrete, birthPlace, Teruel)(Teruel, locatedIn, Spain)(EnnioMorricone, composed, l‘Arena)(JavierNavarrete, composerOf, aTale)

CityRomeinstanceOf

(uri1, hasName, EnnioMorricone)(uri1, bornIn, uri2)(uri2, hasName, Rome)(uri2, locatedIn, uri3)…

bornIn (EnnioMorricone, Rome) locatedIn(Rome, Italy)

Facts about Facts

• temporal annotations, witnesses/sources, confidence, etc. can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: Col Sub Prop Obj)

facts: (EnnioMorricone, composed, l‘Arena)

(JavierNavarrete, composerOf, aTale)

(Berlin, capitalOf, Germany)

(Madonna, marriedTo, GuyRitchie)

(NicolasSarkozy, marriedTo, CarlaBruni)

facts:1:

temporal facts:6: (1, inYear, 1968)

7: (2, inYear, 2006)

8: (3, validFrom, 1990)

9: (4, validFrom, 22-Dec-2000)

10: (4, validUntil, Nov-2008)

11: (5, validFrom, 2-Feb-2008)provenance:12: (1, witness, http://www.last.fm/music/Ennio+Morricone/)

13: (1, confidence, 0.9)

14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie)

15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer))

16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php)

17: (10, confidence, 0.1)

rdf.freebase.com/ns/en.rome

owl:sameAs

wl:sameAs

data.nytimes.com/51688803696189142301

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

f:type

rdfs:subclassOf

yago/wordnet:Actor109765278

f:type

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

SPARQL Query LanguageSPJ combinations of triple patterns(triples with S,P,O replaced by variable(s)) Select ?p, ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . }

+ filter predicates, duplicate handling, RDFS types, etc.

Select Distinct ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name ?n . ?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }

Semantics:return all bindings to variables that match all triple patterns(subgraphs in RDF graph that are isomorphic to query graph)

Querying the Structured WebStructure but no schema: SPARQL well suited

wildcards for properties (relaxed joins): Select ?p, ?c Where { ?p instanceOf Composer . ?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . }

Extension: transitive paths [K. Anyanwu et al.: WWW‘07] Select ?p, ?c Where { ?p instanceOf Composer . ?p ??r ?c . ?c isa Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter (containsAny(??r,?t ) . ?t isa City . }

Extension: regular expressions [G. Kasneci et al.: ICDE‘08] Select ?p, ?c Where { ?p instanceOf Composer . ?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }

flexiblesubgraphmatching

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: triples match struct. pred.witnesses match text pred.

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

French politicians married to Italian singers?

Select ?p1, ?p2 Where { ?p1 instanceOf politician [France] . ?p2 instanceOf singer [Italy] . ?p1 marriedTo ?p2 . }

Grouping ofkeywords or phrasesboosts expressiveness

Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [computer science] . ?a workedOn ?x [Manhattan project] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [computer science] . ?a ?p2 ?o2 [Manhattan project] .?r ?p3 ?a . }

Relatedness Queries

Schema-agnostic keyword search(on RDF, ER graph, relational DB)becomes a special case

Relationship between Jet Li, Steve Jobs, Carla Bruni?

Select ??p1, ??p2, ??p3 Where {LiLianjie ??p1 SteveJobs. SteveJobs ??p2 CarlaBruni .CarlaBruni ??p3 LiLianjie . }

Select ??p1, ??p2, ??p3 Where {?e1 ?r1 ?c1 [“Li Lianjie“] . ?e2 ?r2 ?c2 [“Steve Jobs“] . ?e3 ?r3 ?c3 [“Carla Bruni“] . ?e1 ??p1 ?e2 .?e2 ??p2 ?e3 .?e3 ??p3 ?e1 . }

Querying Temporal Facts

• Consider temporal scopes of reified facts• Extend Sparql with temporal predicates

Managers of German clubs who won the Champions League?Select ?m Where { isa soccerClub . ?c inCountry Germany .

?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . ?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u .

[?s,?u] overlaps [?t,?t] . }

Problem: not all facts hold forever (e.g. CEOs, spouses, …)

When did a German soccer club win the Champions League?Select ?c, ?t Where { ?c isa soccerClub . ?c inCountry Germany .?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . }

1: (BayernMunich, hasWon, ChampionsLeague)2: (BorussiaDortmund, hasWon, ChampionsLeague)3: (1, validOn, 23May2001) 4: (1, validOn, 15May1974) 5: (2, validOn, 28May1997)6: (OttmarHitzfeld, manages, BayernMunich)7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004)

[Y. Wang et al.: EDBT’10][O. Udrea et al.: TOCL‘10

Querying with Vague Temporal Scope

• Consider temporal phrases as text conditions• Allow approximate matching and rank results wisely

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Problem: user‘s temporal interest is often imprecise

German Champion League winners in the nineties?Select ?c Where {

?c isa soccerClub . ?c inCountry Germany .?c hasWon ChampionsLeague [nineties] . }

Soccer final winners in summer 2001?Select ?c Where {

?c isa soccerClub . ?id: ?c matchAgainst ?o [final] .?id winner ?c [“summer 2001“] . }

Keyword Search on Graphs

Example: Conferences (CId, Title, Location, Year) Journals (JId, Title)CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person)Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95

Schema-agnostic keyword search over multiple tables:graph of tuples with foreign-key relationships as edges

[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …]

Result is connected tree with nodes that contain as many query keywords as possible

Ranking: 1)(1)1(),(),(

eedgesnnodes

eedgeScoreqnnodeScoreqtrees

with nodeScore based on tf*idf or prob. IRand edgeScore reflecting importance of relationships (or confidence, authority, etc.)

Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)

Best Result: Group Steiner Tree

Result is connected tree with nodes that contain as many query keywords as possible

Group Steiner tree: • match individual keywords terminal nodes, grouped by keyword• compute tree that connects at least one terminal node per keyword and has best total edge weight

for query: x w y z2-21

Outline

Motivation

Wrap-up

Informative Ranking

User Interface

Ranking CriteriaConfidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)

livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Diversity:Prefer variety of facts

Einstein won NobelPrizeBohr won NobelPrize

Einstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

E won … E discovered … E played … E won … E won … E won … E won …

Ranking ApproachesConfidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical LM with estimations from:

frequency in answer frequency in corpus (e.g. Web) frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph cost of Steiner tree

PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models

Diversity:Prefer variety of facts

empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models

graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., G.Kasneci et al., …]

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 isMarriedTo ?a2 .?a1 actedIn ?m .?a2 actedIn ?m .

http://www.mpi-inf.mpg.de/yago-naga/rdf-express/

Example: RDF-Express

?a1 isMarriedTo ?a2 {Bollywood} .?a1 actedIn ?m .?a2 actedIn ?m .

Example: RDF-Express

?a1 isMarriedTo ?a2 .?a1 actedIn ?m {thriller} .?a2 actedIn ?m {love} .

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 directed ?m .?a2 actedIn ?m .?a1 hasWonPrize Academy_Award .?a2 type wordnet_musician .

Information Retrieval Basics[Textbooks by R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …]

......

crawlextract& clean

index search rank present

strategies forcrawl schedule andpriority queue for crawl frontier

handle dynamic pages,detect duplicates,detect spam

build and analyzeWeb graph,index all tokensor word stems

server farm with 10 000‘s of computers,distributed/replicated data in high-performance file system,massive parallelism for query processing

fast top-k queries,query logging,auto-completion

scoring functionover many dataand context criteria

GUI, user guidance,personalization

Ranking bydescendingrelevance

Vector Space Model for Content Relevance Ranking

Search engine

Query (set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

1:),(F

Similarity metric:

Vector Space Model for Content Relevance Ranking

Search engine

Query (Set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

1:),(F

Similarity metric:Ranking bydescendingrelevance

k ikijij wwd 2/:

ijij fwithdocs

dffreq

dffreqw

),(max

),(1log:

tf*idfformula:term frequency *document frequency

Statistical Language Models (LM‘s)

qLM(1)

• each doc di has LM: generative prob. distr. with params i

• query q viewed as sample from LM(1), LM(2), …

• estimate likelihood P[ q | LM(i) ]

that q is sample of LM of doc di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)

[Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001]

„God does not play dice“ (Albert Einstein)

„IR does“ (anonymous)

„God rolls dice in places where you can‘t see them“ (Stephen Hawking)

LM: Doc as Model, Query as Sample

E E E E

model M

document d: sample of Mused for parameter estimation

P [ | M]A A B C E E

estimate likelihoodof observing query

)()(...)()(

jdpjfjfjf

multinomial prob. distribution

LM: Need for Smoothing

E E E E

model M

document d

P [ | M]A B C E F

estimate likelihoodof observing query

+ background corpus and/or smoothing

used for parameter estimation

Laplace smoothingJelinek-MercerDirichlet smoothing…

Some LM Basics

i i dqPdqPqds ]|[]|[),(simple MLE: overfitting

][)1(]|[),( qPdqPqds

),(log~

),(1log~

mixture modelfor smoothing

qiPqiPdqKL

]|[log]|[)|(~

KL divergence(Kullback-Leibler div.)aka. relative entropy

tf*idffamily

),(log~

independ. assumpt.

efficientimplementation

• Precompute per-keyword scores • Store in postings of inverted index• Score aggregration for (top-k) multi-keyword query

P[q] est. fromlog or corpus

IR as LM Estimation

P[R|d,q]user likes doc (R)given that it has features dand user poses query q

prob. IR

][]|,[~ RPRdqP

][]|[],|[ RPRdPRdqP

]|[~ dqP statist. LM

qj d ]|j[Plog]d|q[Plog)d,q(s

query likelihood:

]d|q[Plogargmax-k d

top-k query result:

MLE would be tf2-36

Multi-Bernoulli vs. Multinomial LM

Multi-Bernoulli:)q(X1

jjj ))d(p1()d(p]d|q[P

with Xj(q)=1 if jq, 0 otherwise

Multinomial:

)()(...)()(

||]|[ qf

jdpjfjfjf

with fj(q) = f(j) = frequency of j in q

multinomial LM more expressive and usually preferred2-37

LM Scoring by Kullback-Leibler Divergence

||2122 )(

)(...)()(

||log]|[log qf

jdpjfjfjf

)(log)(~ 2 dpqf jjqj

))(),(( dpqfH neg. cross-entropy

))(())(),((~ qfHdpqfH ))(||)(( dpqfD

)(log)( 2 dp

j j neg. KL divergenceof q and d

Jelinek-Mercer SmoothingIdea:use linear combination of doc LM withbackground LM (corpus, common language);

could also consider query log as background LMfor query||

),()1(

),()(ˆ

Cjfreq

djfreqdp j

parameter tuning of by cross-validation with held-out data:• divide set of relevant (d,q) pairs into n partitions• build LM on the pairs from n-1 partitions• choose to maximize precision (or recall or F1) on nth

partition• iterate with different choice of nth partition and average

Jelinek-Mercer Smoothing:Relationship to tf*idf

]q[P)1(]d|q[P]|q[P

)i(df)1(

)d,k(tf

)d,i(tflog~

k)i(df

1)d,k(tf

)d,i(tf1log~

with absolutefrequencies tf, df

relative tf ~ relative idf

Dirichlet-Prior Smoothing

]C|j[P

]d|j[P|d|

ii)(Mmaxargˆ)d(p̂ jj

with i set to P[i|C]+1 for the Dirichlet hypergeneratorand > 1 set to multiple of average document length

Dirichlet (): 1jm..1j

jm..1j

jm..1jm1

)(),...,(f

m..1j j 1

(Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters)

d]][|f[P

][P]|f[P]f|[P:)(M

Posterior distr. with Dirichlet distribution as prior

)f(Dirichlet with term frequencies fin document d

MAP (Maximum Posterior) for

Dirichlet-Prior Smoothing:Relationship to Jelinek-Mercer Smoothing

]C|j[P

]d|j[P|d| ]|[)1(]|[)(ˆ CjPdjPdp j

where 1= P[1|C], ..., m= P[m|C] are the parameters

of the underlying Dirichlet distribution, with constant > 1typically set to multiple of average document length

with MLEsP[j|d], P[j|C]

fromcorpus

Entity Search with LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …]

LM (entity e) = prob. distr. of words seen in context of e

][)1(]|[),( qPeqPqes ]q[P

]e|q[P~

query q: „French player who won world championship“

candidate entities:

e1: David Beckham

e2: Ruud van Nistelroy

e3: Ronaldinho

e4: Zinedine Zidane

e5: FC Barcelona

played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …

weightedby conf.

Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...

))e(|)q((KL~ LMLM

query: keywords answer: entities

LM‘s: from Entities to Facts

Document / Entity LM‘s

Triple LM‘s

LM for doc/entity: prob. distr. of words

LM for query: (prob. distr. of) words

LM‘s: rich for docs/entities, super-sparse for queries

richer query LM with query expansion, etc.

LM for facts: (degen. prob. distr. of) triple

LM for queries: (degen. prob. distr. of) triple pattern

LM‘s: apples and oranges• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !

LM‘s for Triples and Triple Patterns

f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid

triples (facts f):triple patterns (queries q):q: Beckham p ?y

200 300 20 30300150 20200350 10400200150100150 20

: 2600

q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan

200/550300/550 20/550 30/550

witness statistics

q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500

q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30

LM(q) + smoothing

q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…

LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))

[G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11]

LM‘s for Composite Queriesq: Select ?x,?c Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20f23: F ml Henry 200f24: F ml Ribery 200f25: F ml Drogba 30f26: IC ml Drogba 100f27 ALG ml Zidane 50

queries q with subqueries q1 … qn

results are n-tuples of triples t1 … tn

LM(q): P[q1…qn] = i P[qi]

LM(answer): P[t1…tn] = i P[ti]

KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

P [ F ml Henry, Henry p Arsenal, Arsenal in UK ]

P [ F ml Drogba, Drogba p Chelsea, Chelsea in UK ]

LM‘s for Keyword-Augmented Queries

q: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }

subqueries qi with keywords w1 … wm

results are still n-tuples of triples ti

LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]

LM(answer fi) analogous

KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))

result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords

LM‘s for Temporal Phrases

q: Select ?c Where { ?c instOf nationalTeam . ?c hasWon WorldCup [nineties] . }

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

July 1994mid 90s

extract temp expr‘s xj

from witnesses& normalize

[K.Berberich et al.: ECIR‘10]

“mid 90s“ xj = [1Jan93,31Dec97] normalize

temp expr vin query

“nineties“ v = [1Jan90,31Dec99]

P[q | t] ~ … j P[v | xj]

~ … j { P[ [b,e] | xj] | interval [b,e]v}

~ … j overlap(v,xj) / union(v,xj)

• enhanced ranking• efficiently computable• plug into doc/entity/triples LM‘s

lastcentury

summer 1990

P[v=„nineties“ | xj = „mid 90s“] = 1/2P[„nineties“ | „summer 1990“] = 1/30P[„nineties“ | „last century“] = 1/10

Query Relaxationq: … Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20F23: F ml Henry 200F24: F ml Ribery 200F26: IC ml Drogba 100F27 ALG ml Zidane 50

[ F ml Zidane, Zidane p Real, Real in ESP ]

q(1): … Where { France ml ?x . ?x p ?c . ?c in ?y . }

[ IC ml Drogba, Drogba p Chelsea, Chelsea in UK] [ F resOf Drogba,

Drogba p Chelsea, Chelsea in UK] [ IC ml Drogba,

Drogba p Chelsea, Chelsea in UK]

q(2): … Where { ?x ml ?x . ?x p ?c . ?c in UK . }q(3): … Where { France ?r ?x . ?x p ?c . ?c in UK . }q(4): … Where { IC ml ?x . ?x p ?c . ?c in UK . }

LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …

replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))

replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...

are prob. distr.‘s of triples !

Result Personalization

Open issue: „insightful“ results (new to the user)

f1f2f5

q4 q5f3

same answer for everyone?

Personal histories ofqueries & clicked facts LM(user u): prob. distr. of triples !

[S. Elbassuoni et al.: PersDB‘08]

u1 [classical music] q: ?p from Europe . ?p hasWon AcademyAward

u2 [romantic comedy] q: ?p from Europe . ?p hasWon AcademyAward

u3 [from Africa] q: ?p isa SoccerPlayer . ?p hasWon ?a

LM(q|u] = LM(q) + (1) LM(u)then business as usual

Result Diversification

q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon

rank results f1 ... fk by ascending

KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool

[J. Carbonell, J. Goldstein: SIGIR‘98]

Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]

EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.):

• define authority transfer graph

among entities and pages with edges:• entity page if entity appears in page• page entity if entity is extracted from page• page1 page2 if there is hyperlink or implicit link between pages• entity1 entity2 if there is a semantic relation between entities

• edges can be typed and (degree- or weight-) normalized and

are weighted by confidence and type-importance• also applicable to graph of DB records with foreign-key relations

(e.g. bibliography with different weights of publisher vs. location for conference record)

• compared to standard Web graph, ER graphs of this kind

have higher variation of edge weights

Outline

Motivation

Wrap-up

Informative Ranking

User Interface

Scalable Semantic Web: Pattern Queries on Large RDF Graphs

schema-free RDF triples: subject-property-object (SPO) example: Einstein hasWon NobelPrizeSPARQL triple patterns: Select ?p,?c Where { ?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe}large join queries, unpredictable workload,difficult physical design, difficult query optimization

Einstein hasWon NobelEinstein bornIn UlmRonaldo hasWon FIFASpain partOf EuropeFrance partOf Europe… … .,.

S P O S O Einstein NobelRonaldo FIFA… …

hasWon

S hasWon bornIn .,.

Person

Einstein Nobel Ulm .,.Ronaldo FIFA Rio .,..,. .,. .,. .,.

Semantic-Web engines (Sesame, Jena, etc.)did not provide scalable query performance

CountryS partOf capital .,..,. .,. .,. .,.

S O … …

bornIn

AllTriples

?p ?b ?t

Scalable Semantic Web: RDF-3X Engine[T. Neumann et al.: VLDB 2008, SIGMOD 2009, VLDBJ’10]

• RISC-style, tuning-free system architecture

• map literals into ids (dictionary) and precompute

exhaustive indexing for SPO triples:

SPO, SOP, PSO, POS, OSP, OPS,

SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O*

very high compression

• efficient merge joins with order-preservation

• join-order optimization

by dynamic programming over subplan result-order

• statistical synopses for accurate result-size estimation

http://code.google.com/p/rdf3x/

http://www.mpi-inf.mpg.de/~neumann/rdf3x/

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

scan PSO[P=inCountry]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

|| [S=S]

|| [O=S]

poorjoin order

P=isa]

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe good

join order

P=isa]

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S…

3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S…

2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

sidewaysinformationpassing

run-time filters

Join-Order Optimization for SPARQL

mehasFriend

gender

livesIn

Berlin

?f ?s ?p ?a

singer

Berlin

performedIn

BobDylan

composedBy

est antiwar

female

Fine-grained RDF (e.g. over social-tagging data)often entails complex queries with 10, 20 or more joins

Join-order optimization operates on join graph:nodes are triple patterns, edges denote shared variables• need selectivity (result cardinality) estimator • join-order optimization is O(n3) for chains, O(2n) for stars• exact optimization uses dynamic programming over sub-graphs and the sort order of intermediate results

[T. Neumann: VLDBJ‘10]

?f gender female .?f hasFriend me .?f livesIn Berlin .?f likes ?s .?s taggedProtestBy ?x .?s taggedAntiwarBy ?y .?s composedBy BobDylan .?s performedIn ?p .?p inCity Berlin .?s inYear 2009 .?p singer ?a .

Experimental Evaluation: Setup

Setup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by MIT folks, using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Benchmark queries such as:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }

[T. Neumann: SIGMOD‘09]

Experimental Evaluation: Results

YAGO dataset(40 Mio. triples, 3.1 GB):2.7 GB in RDF-3X, loaded in 25 min

Librarything dataset(36 Mio. triples, 1.8 GB):1.6 GB in RDF-3X, loaded in 20 min

measurements on Uniprot (845 Mio. triples) and Billion-Triples (562 Mio. triples)confirm these experimental results

[T. Neumann: SIGMOD‘09]

Madrid

Distributed RDF Stores & Sparql at Scale[D. Abadi et al.: VLDB‘11]

• Partition data graph by hashing triples on subject• Enhance locality by replicating N-hop neighbors with each triple• Can automatically determine when query is parallelizable without communication• For remaining cases: distributed joins, semi-joins, etc.

Iniesta

FC Barca

Barcelona

?player

?city?team

> 1 000 000

locatedIn

bornIn

playsPos

playsForhasPop

midfielderMadrid

Rosario

Gelsenkirchen

Casillas

Iniesta

FC Barca

Barcelona

Madrid

Rosario

Gelsenkirchen

Top-k Query Processing for Ranked Results

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

t1d780.9

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

fetch listsjoin&sort

simple & DB-style;needs only O(k) memory;for monotone score aggr.

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

t1d780.9

d880.2d780.1d340.1

d230.8

d100.8

t2d640.9

d230.6

d100.6

t3d100.7

d780.5

d640.3

Threshold algorithm (TA):scan index lists; consider d at posi in Li;highi := s(ti,d);if d top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m};if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’ top-k};threshold := aggr {high | =1..m};if threshold min-k then exit;

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 3Scan

depth 3

simple & DB-style;needs only O(k) memory;for monotone score aggr.

Scan depth 4Scan

depth 4

d990.2

d120.2 2 d64 0.9

Rank Doc Score

2 d64 1.2

1 d78 0.91 d78 1.51 d78 1.5

2 d64 0.9

Rank Doc Score

2 d78 1.5

1 d10 2.1Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

STOP!STOP!

Threshold Algorithm (TA) [Fagin 01, Güntzer 00, Nepal 99, Buckley85]

TA with Sorted Access Only (NRA) [Fagin 01, Güntzer et al. 01]

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Rank Doc Worst-score

Best-score

1 d78 0.9 2.4

2 d64 0.8 2.4

3 d10 0.7 2.4

Best-score

1 d78 1.4 2.0

2 d23 1.4 1.9

3 d64 0.8 2.1

4 d10 0.7 2.1

Best-score

1 d10 2.1 2.1

2 d78 1.4 2.0

3 d23 1.4 1.8

4 d64 1.2 2.0

t1d780.9

d880.2

d120.2

d780.1

d990.2

d340.1

d230.8

d100.8

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!STOP!

No-random-access algorithm (NRA):scan index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d};threshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 3Scan

depth 3

sequential access (SA) faster than random access (RA)by factor of 20-1000

Query Expansion (Synonyms, Related Terms, …)

magician

wizard

intellectual

artist

alchemist

directorprimadonna

lecturer

professor

teacher

educator

scholar

academic, academician, faculty member

scientist

researcher

HYPONYM (0.749)HYPONYM (0.749)

Thesaurus/Ontology:concepts, relationships, glosses from WordNet, Wikipedia, KB, Web, etc.

relationships quantified bystatistical correlation measurese.g. Jaccard coefficients: |XY| / |XY|

Query expansion

Weighted expanded queryExample:(professor lecturer (0.749) scholar (0.71) ...)and Germanyand ( (course class (1.0) seminar (0.84) ... ) and („IR“ „Web search“ (0.653) ... ) )

Term2Concept with WSD

exp(ti)={w | sim(ti,w) }

Efficient top-k searchwith dynamic expansion

better recall, better meanprecision for hard queries

investigator

mentor

Problems:• efficiency• tuning the threshold

User query: ~t1 ... ~tmExample:~professor Germany ~course ~IR

Query Expansion with Incremental Merging

relaxable query q: ~professor researchwith expansions based on ontology relatedness modulatingmonotonic score aggregation

Better: dynamic query expansion with incremental merging of additional index lists

efficient, robust, self-tuning

lecturer: 0.7

37: 0.944: 0.8

22: 0.723: 0.651: 0.652: 0.6

scholar: 0.692: 0.967: 0.9

52: 0.944: 0.855: 0.8

research

B+ tree index on terms

57: 0.644: 0.4

professor

52: 0.433: 0.375: 0.3

12: 0.914: 0.8

28: 0.617: 0.5561: 0.544: 0.5

44: 0.4

ontology / meta-index

professorlecturer: 0.7scholar: 0.6academic: 0.53scientist: 0.5...

exp(i)={w | sim(i,w) }

primadonna

teacher

investigator

magician

wizard

intellectual

artist

alchemist

director

professor

scholar

academic,academician,faculty member

scientist

researcher

Hyponym (0.749)Hyponym (0.749)

mentor

Related (0.48)Related (0.48)

lecturerTA scans of index lists for iq exp(i)

[M. Theobald et al.: SIGIR 2005]

Query Expansion with Incremental Merging[M. Theobald et al.: SIGIR 2005]

Principles of the algorithm:• organize possible expansions Exp(t)={w | sim(t,w) , tq} in priority queues based on sim(t,w) for each t

for given q = {t1, tm}: • keep track of active expansions for each ti: ActExp(ti) = {w1(ti), w2(ti), ...w j(i)(ti)}

~ position j(i) in the priority queue for Exp(ti)

• scan index lists for t1, w1(t1), ..., wj(1)(t1), ..., tm, w1(tm), ..., wj(m)(tm)

maintaining cursors pos() and score bounds high() for each list • in each scan step do for each ti:

advance cursor of list for w(ti) ActExp(ti)

for which best := high()*sim(w,ti) is largest among all lists of ti

or add next largest w(ti) to ActExp(ti) if sim(w,ti) > best

and proceed in this new list

Nested Top-k [Theobald et al.: SIGIR’05]

research professor XML „information retrieval“ „query result ranking“

research prof XML informationretrieval

query result ranking

top-k top-k

multithreaded coroutine-like evaluation with synchronization of PQs

Nested Top-k [Theobald et al.: SIGIR’05]

~research ~professor XML „information retrieval“ „query result ranking“

top-k top-k

research prof XML informationretrieval

query result ranking

top-k top-k

science scholar lecturer

multithreaded coroutine-style evaluation with synchronization of PQs

Top-k Rank Joins on Structured Data[Ilyas et al. 2008]

extend TA/NRA/etc. to ranked query results from structured data(improve over baseline: evaluate query, then sort)

Select R.Name, C.Theater, C.MovieFrom RestaurantsGuide R, CinemasProgram CWhere R.City = C.CityOrder By R.Quality/R.Price + C.Rating Desc

BlueDragon Chinese €15 SBHaiku Japanese €30 SBMahatma Indian €20 IGBMescal Mexican €10 IGBBigSchwenk German €25 SLS...

Name Type Quality Price City

RestaurantsGuide

BlueSmoke Tombstone 7.5 SBOscar‘s Hero 8.2 SBHolly‘s Die Hard 6.6 SBGoodNight Seven 7.7 IGBBigHits Godfather 9.1 IGB...

Theater Movie Rating City

CinemasProgram

process index lists on quality desc, price asc, rating desc2-71

Top-k Sparql Queries over RDF

?w wrote ?b . ?b hasTag Memoir .

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Mark_Twain wrote Roughing_it 0.999J_K_Rowling wrote Harry_Potter 0.982Stephanie_Meyer wrote Twilight 0.966Cormac_McCarthy wrote The_Road 0.864Jimmy_Carter wrote Palestine 0.732………………………………… …….

[S. Elbassuoni 2012]• Precompute score-sorted index lists for each S value, P, O, SP value pair, PO, SO• Perform rank-join over index lists

Top-k Sparql Queries over RDF + Text [S. Elbassuoni 2012]

?w wrote ?b[nobel prize]; ?b hasTag Memoir

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Jimmy_Carter wrote Palestine 0.712Albert_Camus wrote The_Rebel 0.667 Doris_Lessing wrote Alfred_and_Emily 0.660 ………………………………… …….

Ian_McEwan wrote On_Chesil_Beach 0.882Jimmy_Carter wrote Palestine 0.774 Cormac_McCarthy wrote The_Road 0.623 Jhumpa_Lahiri wrote The_Namesake 0.600………………………………… …….

• Precompute score-sorted index lists for each S+keyword, P+keyword, O+…, SP+keyword, PO+…, SO+…• Perform rank-join over index lists

Outline

Motivation

Wrap-up

Informative Ranking

User Interface

UI: Table Search by Example & Description Expect result table specify query in tabular form (QBE, QueryByDescription)

[S.Sarawagi et al.: VLDB‘09]

q1: David Beckham Real MadridZinedine Zidane JuventusLionel Messi FC Barcelona

Bayern Munich Real Madrid1.FC Kaiserslautern Real MadridBorussia Dortmund Real Madrid

q2: 2:15:05:1

match against table rows

infer query fromtable headers ???

soccer club from Germany and winner against Real Madridq2:

output class descriptionQBD

player club ???

???q1:QBD

UI: Structured Keyword Search Need to map (groups of) keywords onto entities & relationshipsbased on name-entity similarities/probabilities

q: Champions League finals with Real Madrid

[Ilyas et al. Sigmod‘10]

UEFA Champions League

Real Madrid C.F.League ofWrestlingChampions

final match

final exam

q: German football clubs that won (a match) against Real

more ambiguity, more sophisticated relation more candidates combinatorial complexity

Real Madrid C.F.

Real Zaragoza

Brazilian Real

Familia Real

Real Number

soccer

American football

social club

night club

sports team

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Which German soccer clubs have won against Real Madrid ?

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Which ?c

?c German soccer clubs ?r ?r have won against ?o

?o Real Madrid ?

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Select ?c Where { ?c isa SoccerClub . c? in Germany . }

?c wonAgainst RealMadrid . }?id: ?c playedAgainst RealMadrid . ?c isWinnerOf ?id .}

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Select ?c Where { ?c isa GermanSoccerClub . c? wonAgainst RealMadrid . }

SPARQL or Keywords or … ?

SPARQL query: Select ?p Where {?p created ?s . ?s type music . ?s contributesTo ?m . ?m type movie .?p bornIn Rome . }

Who composed scores for westerns and is from Rome?

Keywords query:composer score western Rome

or: {composer Rome} score westernor: …

Faceted search with interactive navigation:composer

Romescore music movie

western2-79

Natural Language Question AnsweringIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?

used Yago & Dbpedia for type checking

[IBM Journal of R&D 56(3-4), 2012]

question

QuestionAnalysis:ClassificationDecomposition

HypothesesGeneration(Search):AnswerCandidates

CandidateFiltering &Ranking

Hypotheses& EvidenceScoring

answer

Overall architecture of Watson (simplified)

Semantic Technologies in IBM WatsonIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?

used Yago & Dbpedia for type checking

[A. Kalyanpur et al.: ISWC 2011]

Semantic checking of answer candidatesquestion

candidatestring

ConstraintChecker

TypeChecker

lexicalanswer type

RelationDetection

EntityDisambiguation& Matching

PredicateDisambiguation& Matching

candidatescore

KB instances

semantic types

spatial & temporalrelations

Scoring of Semantic Answer Types

[A. Kalyanpur et al.: ISWC 2011]

Check for 1) Yago classes, 2) Dbpedia classes, 3) Wikipedia lists

Match lexical answer type against class candidatesbased on string similarity and class sizes (popularity)Examples: Scottish inventor inventor, star movie star

Compute scores for semantic types, considering:class match, subclass match, superclass match,sibling class match, lowest common ancestor, class disjointness, …

no types Yago Dbpedia Wikipedia all 3Standard QAaccuracy 50.1% 54.4% 54.7% 53.8% 56.5%Watsonaccuracy 65.6% 68.6% 67.1% 67.4% 69.0%

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

Documents

Gerhard Weikum Max Planck Institute for Informatics & Saarland University weikum

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max

What Computers Should Knowpeople.mpi-inf.mpg.de/~weikum/weikum-CCKS-Beijing-sep...(Liu Cixin, computer engineer) translatedBy (Liu Cixin, Liu Ken) hasFavoriteBooks (Liu Cixin, { Arthur

Gerhard Weikum - Max Planck Institute for Informatics

Gerhard Weikum (weikum@mpi-inf.mpg.de) Peer-to-Peer Web Search with Minerva In collaboration with: Matthias Bender, Debora Donato, Alessandro Linari, Julia

Gerhard Weikum Max Planck Institute for Informatics weikum/ Heart of Gold Searching and Ranking Web Nuggets based on joint work

Weikum@mpi-inf.mpg.de weikum/ Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web joint work with Shady

Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken,

YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

VLDB Database School (China) VLDB 中国数据库学院people.mpi-inf.mpg.de/~weikum/weikum-vldbschool-kunming-final... · VLDB Database School (China) ... Michael Franklin is the

1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile

What Computers Know, Should Know, andpeople.mpi-inf.mpg.de/~weikum/weikum-WebDB-June2016.pdf · „Morricone‘smusic“ „appeared in“ „For a Few Dollars More“ „ the m aestro‘s

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg

DEANNA: Natural Language Questions for the Web of Data Mohamed Yahya † Klaus Berberich † Shady Elbassiousni * Maya Ramanath ‡ Volker Tresp # Gerhard Weikum

Relevance feedback using query-logs Gaurav Pandey Supervisors: Prof. Gerhard Weikum Julia Luxenburger

VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany

Sreyasi Nag Chowdhury, Niket Tandon, Gerhard Weikum · Sreyasi Nag Chowdhury, Niket Tandon, Gerhard Weikum ... “Wow! Double-decker buses ... Existing CSK knowledge bases: WordNet,

INEX ‘05 TopX @ INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken