TopX @ INEX ‘05

INEX ‘05

TopX @ INEX ‘05TopX @ INEX ‘05

Martin TheobaldMartin TheobaldRalf SchenkelRalf Schenkel

Gerhard WeikumGerhard Weikum

Max Planck Institute for InformaticsMax Planck Institute for InformaticsSaarbrückenSaarbrücken

An Efficient and Versatile Query Engine for TopX Search 2INEX ‘05

//article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)]

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment.”

item

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML queries with an expres- sive power similar to that of Datalog …”

par

title“XML-QL: A Query Language for XML.”

“Native XML database systems can store schemaless data ... ” inproc

“Proc. Query Languages Workshop, W3C,1998.”

title“Native XML databases.”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “TheXML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

“What does XML add for retrieval? It adds formal ways …”

bib

“w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”


TopX: Efficient XML-IR [VLDB ’05]

Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data

Non-schematic, heterogeneous data sources

Combined inverted index for content & structure

Avoid full index scans, postpone expensive random accesses to large disk-resident data structures

Exploit cheap disk space for redundant indexing

Goal: Efficiently retrieve the best results of a similarity query

Goal: Efficiently retrieve the best results of a similarity query


Data Model

Simplified XML modeldisregarding IDRef & XLink/XPointer

Redundant full-contents Per-element term frequencies ftf(ti,e) for full-contentsPre/postorder labels for each tag-term pair

<article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec></article>

“xml ir”

articlearticle

titletitle absabs secsec

“xml ir ir technique xmlclustering xml evaluation“

“ir techniquexml“

“clustering xml evaluation“

“clustering xml”

“evaluation“

titletitle parpar

1 6

2 5 3 4 3 3

5 2 6 1

ftf(“xml”, article1 ) = 3ftf(“xml”, article1 ) = 3


Full-Content Scoring Model

Full-content scores cast into an Okapi-BM25 probabilistic model with

element-specific parameterization

Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avglength k1 b

article 12,223 2,903 10.5 0.75sec 96,709 413 10.5 0.75par 1,024,907 32 10.5 0.75fig 109,230 13 10.5 0.75

per-elementstatistics

Additional static score mass c for relaxable structural conditions


Inverted Block-Index for Content & Structure

eid docid score pre post max-score

46 2 0.9 2 15 0.9

9 2 0.5 10 8 0.9

171 5 0.85 1 20 0.85

84 3 0.1 1 12 0.1

sec[clustering]

title[xml] par[evaluation]

sec[clustering] title[xml] par[evaluation]

Inverted index over tag-term pairs (full-contents)Benefits from increased selectivity of combined tag-term pairsAccelerates child-or-descendant axis, e.g., sec//”clustering”

eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 10 8 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4

eid docid score pre post max-

score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per listFetch all tag-term pairs per doc in one sequential block-accessdocid limits the range of in-memory structural joins

Stored as inverted files or database tables (B+-tree indexes)


Navigational Index

eid docid pre post

46 2 2 15

9 2 10 8

171 5 1 20

84 3 1 12

sec


sec title par

Additional navigational indexNon-redundant element directorySupports element paths and branching path queriesRandom accesses using (docid, tag) as key

Schema-oblivious indexing & querying

eid docid pre post

216 17 2 15

72 3 10 8

51 2 4 12

671 31 12 23

eid docid pre post

3 1 1 21

28 2 8 14

182 5 3 7

96 4 6 4


TopX Query Processing

Adapt Threshold Algorithm (TA) paradigm Focus on inexpensive sequential/sorted accessesPostpone expensive random accesses

Candidate d = connected sub-pattern with element ids and scoresIncrementally evaluate path constraints using pre/postorder labelsIn-memory structural joins (nested loops, staircase, or holistic twig joins)

Upper/lower score guarantees per candidateRemember set of evaluated dimensions E(d)

worstscore(d) = ∑iE(d) score(ti,e)bestscore(d) = worstscore(d) + ∑iE(d) highi

Early threshold terminationCandidate queuingStop, if

ExtensionsBatching of sorted accesses & efficient queue managementCost model for random access scheduling

Probabilistic candidate pruning for approximate top-k results [VLDB ’04]

[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]

[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]


1.0

worst=0.9best=2.9

46 worst=0.5best=2.5

9

TopX Query Processing By Example

eid docid score pre post

46 2 0.9 2 15

9 2 0.5 10 8

171 5 0.85 1 20

84 3 0.1 1 12


216 17 0.9 2 15

72 3 0.8 10 8

51 2 0.5 4 12

671 31 0.4 12 23


3 1 1.0 1 21

28 2 0.8 8 14

182 5 0.75 3 7

96 4 0.75 6 4

worst=1.0best=2.8

3

worst=0.9best=2.8

216

171 worst=0.85best=2.75

72

worst=0.8best=2.65

worst=0.9best=2.8

46

2851

worst=0.5best=2.4

9doc2 doc17 doc1worst=0.9

best=2.75

216

doc5worst=1.0best=2.75

3

doc3

worst=0.9best=2.7

46

2851

worst=0.5best=2.3


171score=1.7best=2.5

46

28

score=0.5best=1.3


216

worst=1.0best=2.65

3

worst=0.85best=2.45

171

worst=0.8best=2.45

72

worst=0.8best=1.6

72

worst=0.1best=0.9

84

worst=0.9best=1.8

216

worst=1.0best=1.9

3

worst=2.2best=2.2

46

2851

worst=0.5best=0.5

9 worst=1.0best=1.6

3

worst=0.85best=2.15


171

182

worst=0.9best=1.0

216

worst=0.0best=2.9

Pseudo-

Candidate

worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35

sec[clustering] title[xml]

Top-2 results

worst=0.946 worst=0.59 worst=0.9

216

worst=1.746

28

worst=2.246

2851

worst=1.0

3

worst=1.6171

182

par[evaluation]1.0 1.0 1.00.9

0.850.1

0.90.80.5

0.80.75

min-2=0.0min-2=0.5min-2=0.9min-2=1.6

sec[clustering]


Candidate queue


CO.Thorough

Element-granularity

Turn query into pseudo CAS query using “//*”

No post-filtering on specific element types

nxCG@10 = 0.0379 (rank 22 of 55)

MAP = 0.008 (rank 37 of 55)

Old INEX_eval: MAP=0.058 (rank 3)


COS.Fetch&Browse

Document-granularity

Rank documents according to their best target element

Strict evaluation of support & target elements

Return all target elements per doc using the document score (no overlap)

MAP = 0.0601 (rank 4 of 19)


SSCAS

Element-granularity with strict support & target elements (no overlap)

nxCG@10 = 0.45 (ranks 1 & 2 of 25)

MAP = 0.0322 & 0.0272 (ranks 1 & 6 )


Top-k Efficiency

02,000,0004,000,0006,000,0008,000,000

10,000,00012,000,000

k

# S

A +

# R

A

Join&Sort

StructIndex+

StructIndex

BenProbe

MinProbe

0.0784,424723,1690.010TopX – BenProbe

0.17

0.09

0.373,25,068761,970n/a10StructIndex

0.26109,122,318n/a10Join&Sort

1.000.341.875,074,38477,482n/a10StructIndex+

0.0364,807635,5070.010TopX – MinProbe

1.000.030.351,902,427882,9290.01,000TopX – BenProbe

relP

rec

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk

relP

rec


Probabilistic Pruning

0.0

0.2

0.4

0.6

0.8

1.0

ε

relPrecP@10MAP

0

200,000

400,000

600,000

800,000

#SA

+ #

RA

TopX -MinProbe

0.07

0.08

0.08

0.08

0.09

0.770.340.0556,952392,3950.2510

1.000.340.0364,807635,5070.0010TopX - MinProbe

0.650.310.0248,963231,1090.5010

0.510.330.0142,174102,1180.7510

0.380.300.0135,32736,9361.0010

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk re

lPre

c


Conclusions & Ongoing Work

Efficient and versatile TopX query processorExtensible framework for text, semi-structured & structured data

Probabilistic ExtensionsProbabilistic cost model for random access schedulingVery good precision/runtime ratio for probabilistic candidate pruning

Full NEXI supportPhrase matching, mandatory terms “+”, negation “-”, attributes “@”Query weights (e.g., relevance feedback, ontological similarities)

ScalabilityOptimized for runtime, exploits cheap disk space

(redundancy factor 4-5 for INEX)Participated at TREC Terabyte Efficiency Task

Dynamic and self-tuning query expansions [Sigir ’05]

Incrementally merges inverted lists for a set of active expansionsVague Content & Structure (VCAS) queries (maybe next year..)

Documents

TopX @ INEX ‘05