View
220
Download
2
Tags:
Embed Size (px)
Citation preview
Tunable Compression of Word-level Index for Versioned Corpora
Klaus Berberich, Srikanta Bedathur, Gerhard WeikumMax-Planck Institute for Informatics
Saarbruecken, Germany
EIIR 2008, Glasgow 2/19
Introduction• Most document collections are not static
– Intranet documents, Mail folders, Blogs, Source-code, and contents of the World Wide Web
– Contents are being archived – possibly time-stamped and/or versioned
• Wikis • Document repositories (SVN, CVS, …) • Desktop• Web Archives!
• Search over evolving collections– Ability to query the collection “as of” given time
• Time-travel Search [BBNW’07]
EIIR 2008, Glasgow 3/19
Outline
• Time-travel Search• Our Time-machine: FluxCapacitor/TTIX
– Phrase Queries in TTIX• FUSION and Controlled FUSION• Experimental Evaluation
EIIR 2008, Glasgow 4/19
Historical Information Needs1. News articles discussing Cola-drinks Cancer
controversy during 2005-20062. Contemporary articles about “Harry Potter and the
Philosopher’s Stone” 3. Angela Merkel’s interview during 2002
EIIR 2008, Glasgow 5/19
Time-Travel Search
Angela Merkel Interview @ 2002
Keyword QueryTime-context for Evaluation & Ranking
Keyword search extended with a time-context for evaluation
Q = q @ ts
Evaluate q using the collection that existed at time ts
Key Challenges
• Dealing with the MASSIVE size
• Adapting the scoring models (typically defined for static collections)
• Efficient query processing
Opportunities
• Redundancy in content
• Sufficiency of good approximations
• Append-only data growth
EIIR 2008, Glasgow 6/19
Outline
• Time-travel Search• Our Time-machine: FluxCapacitor/TTIX
– Phrase Queries in TTIX• FUSION and controlled FUSION• Experimental Evaluation
EIIR 2008, Glasgow 7/19
FluxCapacitor/TTIX
Adapt Inverted Index structure to include validity time-interval of each document-version
Documents D1, D2, D3 are observed to have changed at different times
Timenow
Version-history of Documents
t1 t2 t3 t5t4 t6 t7 t8 t9 t11
D32.2
[t0,t3)
D12.0
[t0,t2)
D31.9
[t3,t7)
D21.87
[t0,t1)
D11.6
[t2,t4)…
Time-stamped Inverted Index
t12t10t13
Vocabulary
t0
D1
D2
D3D3 “deletion”
D3
xx
[t0,t3)
D1
xx
[t0,t2)
D3
xx
[t3,t7)
D2
xx
[t0,t1)
D3
xx
[t0,t3)
D1
xx
[t0,t2)
D3
xx
[t3,t7)
D2
xx
[t0,t1)
D3
xx
[t0,t3)
D1
xx
[t0,t2)
D3
xx
[t3,t7)
D2
xx
[t0,t1)
D3
xx
[t0,t3)
D1
xx
[t0,t2)
D3
xx
[t3,t7)
D2
xx
[t0,t1)
……
…
Do
c. I
ds
[Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007]
• Index Compaction via Approximate Temporal Coalescing• A sublist materialization framework for trading off space-
performance
EIIR 2008, Glasgow 8/19
Phrase Queries • Significantly improve effectiveness• Essential for quickly locating
– entities – e.g., “Coca Cola”, “Where Eagles Dare”,…– concepts – e.g., “Water filtering”– …
• Indexing for Phrase queries– For each word, need to store positional
information for every occurrence– Index-size blowup– Size reduction via gap encoding + space-efficient
coding on positions [Scholer et al. 2002]
EIIR 2008, Glasgow 9/19
Phrase Queries in FluxCapacitor• Baseline:
For each document version dtb, posting of the following structure
positionsdidtt eb ||),[
• Word-positions compressed using standard techniques– (Gap + Elias-/Golomb-)encodings
Validity Time-interval(=64 bits)
Document Identifier(=64 bits)
List of Word-Positions
Can this be Improved?
EIIR 2008, Glasgow 10/19
Outline
• Time-travel Search• Our Time-machine: FluxCapacitor/TTIX
– Phrase Queries in TTIX• FUSION and controlled FUSION• Experimental Evaluation
EIIR 2008, Glasgow 11/19
Word-Positions across Versions
• High Level of Redundancy between versions– Append-only changes leave most parts unchanged– word b between dt1 and dt2
• Numerical closeness of positions– Small shifts in positions– word c between dt2 and dt3
4,2||),[ 21 dtt 4,2||),[ 32 dtt 6,2||),[ 3 dtt nowb:
3||),[ 21 dtt 7,6,3||),[ 32 dtt 9,8,5,3||),[ 3 dtt nowc:
EIIR 2008, Glasgow 12/19
FUSION• Idea:
– Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility
signaturestimestampspositionsdidtt now ||||),[ 0
• Positions: all word-positions in any of the versions• Timestamps: all intermediate version timestamps• Signatures: for each version, a bit-signature of positions
101,110,110|,|6,4,2||),[ 321 ttdtt now
110011,101100,100000|,|9,8,7,6,5,3||),[ 321 ttdtt now
b:
c:
EIIR 2008, Glasgow 13/19
Query Processing – win some, lose some• Save on overall space
– Naïve organization + processing => reads the whole list, computes ranking
– FUSION maintains smaller list, so faster (naïve) query processing
• Who is Naïve !?– Skip pointers to jump ahead during query proc.– In the worst case,
FUSION ends up reading and processing all the versions, instead of just one version!
Baseline - Good performance, Bad storageFUSION - Bad (worst-case) performance, Good storage
EIIR 2008, Glasgow 14/19
Controlled FUSION • Compute a set of fusions over contiguous versions s.t.
– It takes minimal storage for word positions – For any version, the maximum worst case query processing
overhead is within η• Can be set up as an optimization problem• Optimal solution computable in O(n3) time and O(n)
space – Assumption: storage cost is monotonous
– In practice, we found it close to O(n2) ))','cost([))cost([)','[),[ ebebebeb tt,tttttt
EIIR 2008, Glasgow 15/19
Outline
• Time-travel Search• Our Time-machine: FluxCapacitor/TTIX
– Phrase Queries in TTIX• FUSION and controlled FUSION• Experimental Evaluation
EIIR 2008, Glasgow 16/19
Experimental Evaluation• English Wikipedia
– Revision history (2004 – 2005)– 10% sample (~35,000 docs, ~900,000 ver.)
• Baseline: – Elias- code: 97.51 GBytes– Elias- code: 97.77 GBytes
• FUSION:– η between 1.1 – 10– Elias- & Elias- for compressing word-positions in
each fused posting
EIIR 2008, Glasgow 18/19
Conclusions• Time-travel Search
– Key to archive search & analysis– An interesting and important problem!
• Our Time-machine: FluxCapacitor/TTIX– Builds on inverted index framework– Tunable index-size reduction
• FUSION– Adds phrase-querying to FluxCapacitor/TTIX– More than 50% space reduction over baseline
• With 50% worst-case overhead in query proc.