59
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Embed Size (px)

Citation preview

Page 1: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Lucene Boot Camp

Grant IngersollLucid Imagination

Nov. 4, 2008 New Orleans, LA

Page 2: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

2

Schedule

• In-depth Indexing/Searching – Performance, Internals– Filters, Sorting

• Terms and Term Vectors• Class Project• Q & A

Page 3: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

3

Day I Recap

• Indexing– IndexWriter

– Document/Field– Analyzer

• Searching– IndexSearcher

– IndexReader

– QueryParser

• Analysis• Contrib

Page 4: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

4

Indexing In-Depth

• Deletions and Updates• Optimize• Important Internals

– File Formats– Segments, Commits, Merging– Compound File System

• Performance

Page 5: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

5

Lucene File Formats and Structures

• http://lucene.apache.org/java/2_4_0/fileformats.html

• A Lucene index is made up of one or more Segments

• Lucene tracks Documents internally by an int “id”

• This id may change across index operations– You should not rely on it unless you know your index isn’t changing

• You can ask for a Document by this id on the IndexReader

Page 6: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

6

Segments

• Each Segment is an independent index containing:– Field Names– Stored Field values– Term Dictionary, proximity info and normalization factors

– Term Vectors (optional)– Deleted Docs

• Compound File System (CFS) stores all of these logical pieces in a single file

Page 7: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

How Lucene Indexes

• Lucene indexes Documents into memory– At certain trigger points, memory (segments) are committed/flushed to the Directory•Can be forced by calling commit()

– Segments are periodically merged (more in a moment)

Page 8: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

8

Segments and Merging

• May be created when new documents are added

• Are merged from time to time based on segment size in relation to:– MergePolicy– MergeScheduler– Optimization

Page 9: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

9

Merge Policy

• Identifies Segments to be merged

• Two Current Implementations– LogDocMergePolicy– LogByteSizeMergePolicy

• mergeFactor - Max # of segments allowed before merging

Page 10: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

10

MergeScheduler

• Responsible for performing the merge

• Two Implementations:– Serial - blocking– Concurrent - new, background

Page 11: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

11

Optimize

• Optimize is the process of merging segments down into a single segment

• This process can yield significant speedups in search

• Can be slow• Can also do partial optimizes

Page 12: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

12

Final Thoughts On Merging

• Usually don’t have to think about it, except when to optimize

• In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses

• Good to optimize when you can, otherwise, keep a low mergeFactor

Page 13: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Deletion

• A deletion only marks the Document as deleted– Doesn’t get physically removed until a merge

• Deletions can be a bit confusing– Both IndexReader and IndexWriter have delete methods•By: id, term(s), Query(s)

Page 14: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

14

Task

– Build your index from yesterday and then try some deletes•Id, term, Query

– Also try out an optimize on a FSDirectory against the full Reuters sample

– 15-20 minutes

Page 15: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

15

Updates

• Updates are always a delete and an add

• Updates are always a delete and an add– Yes, that is a repeat!– Nature of data structures used in search

• See IndexWriter.updateDocument()

Page 16: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Performance Factors• setRAMBufferSizeMB

– New model for automagically controlling indexing factors based on the amount of memory in use

– Obsoletes setMaxBufferedDocs• maxBufferedDocs

– Minimum # of docs before merge occurs and a new segment is created

– Usually, Larger == faster, but more RAM

Page 17: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

17

More Factors

• mergeFactor– How often segments are merged

– Smaller == less RAM, better for incremental updates

– Larger == faster, better for batch indexing

• maxFieldLength– Limit the number of terms in a Document

• Analysis

• Reuse– Document, TokenStream, Token

Page 18: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Index Threading

• IndexWriter and IndexReader are thread-safe and can be shared between threads without external synchronization

• One open IndexWriter per Directory

• Parallel Indexing– Index to separate Directory instances– Merge using IndexWriter.addIndexes– Could also distribute and collect

Page 19: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Benchmarking Indexing

• contrib/benchmark• Try out different algorithms between Lucene 2.2 and 2.3– contrib/benchmark/conf:

• indexing.alg• indexing-multithreaded.alg

• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -

Dtask.mem=1024M

Page 20: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Benchmarking ResultsRecords/Sec

Avg. T Mem

2.2 421 39MTrunk 2,122 52MTrunk-mt (4)

3,680 57MYour results will depend on analysis, etc.

Page 21: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Searching

• Earlier we touched on basics of search using the QueryParser

• Now look at:– Searcher/IndexReader Lifecycle– Query classes– More details on the QueryParser– Filters– Sorting

Page 22: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Lifecycle

• Recall that the IndexReader loads a snapshot of index into memory– This means updates made since loading the index will not be seen

• Business rules are needed to define how often to reload the index, if at all– IndexReader.isCurrent() can help

• Loading an index is an expensive operation– Do not open a Searcher/IndexReader for every search

Page 23: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

23

Reopen

• It is possible to have IndexReader reopen new or changed segments– Save some on the cost of loading a new index

• Does not close the old reader, so application must

• See DeletionsUpdatesTest.testReopen()

Page 24: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Query Classes• TermQuery is basis for all non-span queries

• BooleanQuery combines multiple Query instances as clauses– should– required

• PhraseQuery finds terms occurring near each other, position-wise– “slop” is the edit distance between two terms

• Take 2-3 minutes to explore Query implementations

Page 25: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Spans

• Spans provide information about where matches took place

• Not supported by the QueryParser

• Can be used in BooleanQuery clauses

• Take 2-3 minutes to explore SpanQuery classes– SpanNearQuery useful for doing phrase matching

Page 26: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

QueryParser

• MultiFieldQueryParser• Boolean operators cause confusion

– Better to think in terms of required (+ operator) and not allowed (- operator)

• Check JIRA for QueryParser issues• http://www.gossamer-threads.com/lists/lucene/java-us

er/40945

• Most applications either modify QP, create their own, or restrict to a subset of the syntax

• Your users may not need all the “flexibility” of the QP

Page 27: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Sorting• Lucene default sort is by score• Searcher has several methods that take in a Sort object

• Sorting should be addressed during indexing

• Sorting is done on Fields containing a single term that can be used for comparison

• The SortField defines the different sort types available– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

Page 28: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Sorting II

• Look at Searcher, Sort and SortField

• Custom sorting is done with a SortComparatorSource

• Sorting can be very expensive– Terms are cached in the FieldCache

Page 29: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Filters

• Filters restrict the search space to a subset of Documents

• Use Cases– Search within a Search– Restrict by date– Rating– Security– Author

Page 30: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Filter Classes

• QueryWrapperFilter (QueryFilter)– Restrict to subset of Documents that match a Query

• RangeFilter– Restrict to Documents that fall within a range

– Better alternative to RangeQuery

• CachingWrapperFilter– Wrap another Filter and provide caching

Page 31: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

31

Task

• Modify your program to sort by a field and to filter by a query or some other criteria– ~15 minutes

Page 32: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Searchers• MultiSearcher

– Search over multiple Searchables, including remote

• MultiReader– Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes

• ParallelMultiSearcher– Like MultiSearcher, but threaded

• RemoteSearchable– RMI based remote searching

• Look at MultiSearcherTest in example code

Page 33: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Expert Results

• Searcher has several “expert” methods

• HitCollector allows low-level access to all Documents as they are scored

Page 34: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Search Performance• Search speed is based on a number of factors:– Query Type(s)– Query Size– Analysis– Occurrences of Query Terms– Optimize– Index Size– Index type (RAMDirectory, other)– Usual Suspects

• CPU• Memory• I/O• Business Needs

Page 35: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Query Types

• Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards

• Avoid starting a WildcardQuery with wildcard

• Use ConstantScoreRangeQuery instead of RangeQuery

• Be careful with range queries and dates– User mailing list and Wiki have useful tips for optimizing date handling

Page 36: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Query Size

• Stopword removal

• Search an “all” field instead of many fields with the same terms

• Disambiguation – May be useful when doing synonym expansion

– Difficult to automate and may be slower

– Some applications may allow the user to disambiguate

• Relevance Feedback/More Like This– Use most important words

– “Important” can be defined in a number of ways

Page 37: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Usual Suspects• CPU

– Profile your application

• Memory– Examine your heap size, garbage collection approach

• I/O– Cache your Searcher

• Define business logic for refreshing based on indexing needs

– Warm your Searcher before going live -- See Solr

• Business Needs– Do you really need to support Wildcards?

– What about date range queries down to the millisecond?

Page 38: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

FieldSelector

• Prior to version 2.1, Lucene always loaded all Fields in a Document

• FieldSelector API addition allows Lucene to skip large Fields– Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break

• Makes storage of original content more viable without large cost of loading it when not used

• FieldSelectorTest in example code

Page 39: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

39

Relevance

• At some point along your journey, you will get results that you think are “bad”

• Is it a big deal?– Content, Content, Content!– Relevance Judgments– Don’t break other queries just to “fix” one

• Hardcode it!– A query doesn’t always have to result in a “search”

Page 40: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Scoring and Similarity

• Lucene has sophisticated scoring mechanism designed to meet most needs

• Has hooks for modifying scores

• Scoring is handled by the Query, Weight and Scorer class

Page 41: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Explanations

• explain(Query, int) method is useful for understanding why a Document scored the way it did

• Shows all the pieces that went into scoring the result:– Tf, DF, boosts, etc.

Page 42: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Tuning Relevance

• FunctionQuery from Solr (variation in Lucene)

• Override Similarity• Implement own Query and related classes

• Payloads• Boosts

Page 43: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

43

Task

• Open Luke and try some queries and then use the “explain” button

• Or, write some code to do explains on a query and some documents

• See how Query type, boosting, other factors play a role in the score

Page 44: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

44

Terms and Term Vectors

• Sometimes you need access to the Term Dictionary:– Auto suggest– Frequency information

• Sometimes you need a Document-centric view of terms, frequencies, positions and offsets– Term Vectors

Page 45: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Term Information• TermEnum gives access to terms and how many Documents they occur in– IndexReader.terms()

• TermDocs gives access to the frequency of a term in a Document– IndexReader.termDocs()

– TermPositions extends TermDocs and provides access to position and payload info– IndexReader.termPositions()

Page 46: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

46

Term Vectors

• Term Vectors give access to term frequency information in a given Document– IndexReader.getTermFreqVector

• TermVectorMapper provides callbacks for working with Term Vectors

Page 47: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

47

TermsTest

• Provides samples of working with terms and term vectors

Page 48: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Lunch ?

1-2:30

Page 49: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Recap

• Indexing• Searching• Performance• Odds and Ends

– Explains– FieldSelector– Relevance– Terms and Term Vectors

Page 50: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

50

Class Project

• Your chance to really dig in and get your hands dirty

• Ask Questions• Options…

Page 51: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

51

Option I

• Start building out your Lucene Application!– Index your Data (or any data)

•Threading/Updates/Deletions•Analysis

– Search•Caching/Warming•Dealing with Updates•Multi-threaded

– Display

Page 52: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

52

Option II

• Dig deeper into an area of interest– Performance

•How fast can you index?•Search? Queries per Second?

– Analysis– Query Parsing– Scoring– Contrib

Page 53: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

53

Option III

• Dig into JIRA issues and find something to fix in Lucene

• https://issues.apache.org/jira/secure/Dashboard.jspa

• http://wiki.apache.org/lucene-java/HowToContribute

Page 54: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

54

Option IV

• Try out Solr• http://lucene.apache.org/solr

Page 55: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

55

Option V

• Other?– Architecture Review/Discussion– Use Case Discussion

Page 56: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Project Post-Mortem

• Volunteers to share?

Page 57: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Open Discussion

• Multilingual Best Practices– UNICODE– One Index versus many

• Advanced Analysis• Distributed Lucene• Crawling• Hadoop• Nutch• Solr

Page 58: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Resources

[email protected]• Lucid Imagination– Support– Training– Value Add– [email protected]

Page 59: Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA

Finally…

• Please take the time to fill out a survey to help me improve this training– Located in base directory of source

– Email it to me at [email protected]

• There are several Lucene related talks on Wednesday