24
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch ([email protected])

Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch ([email protected])

Embed Size (px)

Citation preview

Page 1: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques

with

Michael Busch

([email protected])

Page 2: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda

• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads

• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting

Page 3: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures

InvertedIndex

Store

search

Results

retrieve stored fields

Hits

Page 4: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: not

String comparison slow!

Solution: Inverted index

Page 5: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: notInverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0 1

1

0

0

0 1

0

0

Document IDs

Page 6: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Inverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0 1

1

0

0

0 1

0

0

0 1 2 3 4 5

0 1 2 3 4 5

6 7

Query: ”not to”

Document IDs

Page 7: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Query: ”not to”Inverted index

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0

1

0

0

0

0

0

1

0 1 2 3 4 5

0 1 2 3 4 5

6 7

1

1

3

4

2

7

6

5

0

2

5

0 41

Document IDsPositions

Page 8: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

c:\docs\shakespeare.txt:

To be or not to be.

c:\docs\einstein.txt:

The important thing is not tostop questioning.

Inverted index with Payloads

be

important

is

not

or

questioning

stop

to

the

thing

0

1

1

0

0

0

1

0

0

0

0

0

0 1 2 3 4 5

0 1 2 3 4 5

6 7

1

1

3

4

2

7

6

5

0

2

0

1

5

1

Document IDsPositions Payloads

4

Page 9: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

So far…

• String comparison slow

• Inverted index used to accelerate search

• Store positions in posting lists to allow phrase searches

• Store payloads in posting lists to store arbitrary data with each position

Page 10: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures

InvertedIndex

Store

search

Results

retrieve stored fields

Hits

Page 11: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Store

StoreField 1: titleField 2: contentField 3: hashvalue

Documents:

F3D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

Page 12: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

F3

Store

D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

• Optimized for random access

• Document-locality

Page 13: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

F3

Store

D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3

• Optimized for scanning and skipping

• Value-locality

Posting list with Payloads

D0 D1 D1F30 0 0F3 F3Document IDsPositions Payloads

XXX

Page 14: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda

• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads

• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting

Page 15: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

org.apache.lucene.analysis.Token

void setPayload(Payload payload)

org.apache.lucene.index.TermPositions

int getPayloadLength();byte[] getPayload(byte[] data, int offset)

Payloads - API

Page 16: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Analyzer:

final byte BoldBoost = 5;…Token token = new Token(…);…If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost}));}…return token;

Example: BoostingTermQuery

Page 17: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Similarity:Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; };

Example: BoostingTermQuery

Page 18: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery

BoostingTermQuery:

Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”));

Searching:

Searcher searcher = new IndexSearcher(…);Searcher.setSimilarity(boostingSimilarity);…Hits hits = searcher.search(btq);

Page 19: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Analyzer:

public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token;}}}}

Example: Simple facet counting

Page 20: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Hitcollector:

Example: Simple facet counting

• Use different PriorityQueues for different sites

• Instead of returning top-n results of the whole data set, return top-n results per site

Page 21: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Summary

Example: Simple facet counting

• In this example: facet (site) used for scoring, but extendable for facet counting

• Good performance due to locality of facet values

Page 22: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Conclusion

• Payloads offer great flexibility

• Payloads are stored very space-efficient

• Sophisticated data structures enable efficient skipping over payloads

• Payloads should be used whenever special data is required for finding hits and scoring

Page 23: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Outlook

• Finalize API (currently Beta)

• Add more out-of-the-box query types

• Per-document Payloads

Page 24: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)

Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques

with

Questions ?