Upload
lucidworks
View
68
Download
0
Embed Size (px)
Citation preview
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Solr Highlighting at Full Speed Timothy M. Rodriguez
Verticals Search Team Lead, Bloomberg David Smiley
Search Developer/Consultant
3
Agenda § Legal Search
§ Business Requirements
§ Highlighters Overview
§ Improving the Standard Highlighter
§ Unified Highlighter
§ Questions
01
Bloomberg Law § Suite of legal business tools
for lawyers and legal professionals
§ Business development
§ Drafting
§ Analytics
§ Search
5
Legal Search § Recall Matters
§ Large Documents
§ Citizens United is 130 pages long
§ Some in the 100s of MB
§ Researchers rely on highlighting to help them decide if they should read a document
01
Requirements
ü
Accuracy § Legal users issue detailed searches
§ “cafeteria plan” AND tax
§ Custom Span Queries
§ “insurance fraud” /s conviction
“Just Right” Digest Sizing
9
Full Document Highlighting 01
Zone Highlighting
11
Speed 01
12
Solr Highlighters Overview 01
Highlighter Offset Source Accuracy Speed
Default/Standard HighlighterAnalysis
BetterSlowest
Term Vectors Slow
Fast Vector Highlighter Term Vectors Good Medium
Postings Highlighter Postings (+ Analysis for wildcards) Okay Fast*
But poor wildcard performance
Offset Source & Index Size
§ Analysis requires no extra data on-disk
§ But analyzing text on the fly is expensive
§ Term vectors are heavy
§ Adding offsets to postings is much lighter than TVs
0 0.5 1 1.5 2 2.5 3Multiples of Stored Value
Stored Value
Terms
Positions
Offsets
TV Terms
TV Positions
TV Offsets
14
Initial Attempt § Chose the default highlighter and added in customizations as needed
§ Added payload support to the MemoryIndex - LUCENE-6155 v5.0
§ We investigated using the PostingsHighlighter and FastVectorHighlighter but the accuracy trade-offs were not acceptable for our users
§ Ran into performance problems as highlighting was taking the bulk of our execution time
01
15
Make it Faster § Improvements to the Standard Highlighter
§ Fast uninverting of term vectors to a token stream – LUCENE-6031 v5.0 (remove expensive sort)
§ Rely on Term Vectors for phrase highlighting instead of the analyzing into the MemoryIndex – LUCENE-6034 v5.0
Still not fast enough…
01
16
Multithreaded Highlighting § Highlighting each returned doc is easily parallelizable
§ Greatly improved performance
§ Greatly increased memory consumption
Still a sizeable fraction of our query times…
01
17
Re-evaluating § Didn’t look like we could make the Standard Highlighter much faster
§ Perhaps we could federate to one of the highlighters based on the query?
§ Our customizations would have to be ported to each of the highlighters
§ Work would need to be repeated 3x
§ Increased disk utilization from adding postings to the main index
01
18
Enhance the Postings Highlighter? § Fastest of the bunch
§ Add accuracy at least as good as the standard highlighter
§ Add support for the other offset sources too
§ (supports our full-doc-highlighting use-case)
§ But it’s a big job with major internal highlighter surgery…
Let’s do it!
01
Offsets Overview § Getting character offsets is key to highlighting. 3 ways:
§ Analysis: § Analyzer è TokenStream è OffsetAttribute,
oa.startOffset()
§ Term vectors:§ IndexReader.getTermVector(docId,field) è Terms è
TermsEnum, te.postings(…, PostingsEnum.OFFSETS) è PostingsEnum è pe.startOffset()
§ Postings:§ IndexReader è LeafReader è Terms è … (see above)
PostingsHighlighter Algorithm 1. Fetches all stored-value content needed up-front
2. Highlight in field sorted order, then doc sorted order loop:
1. Get PostingsEnum from a Terms for each query term
2. MTQs: Fake PostingsEnum around filtered TokenStream
3. Process PostingsEnum[ ] into Passage[ ]
java.text.BreakIterator: for passage delineation
PassageScorer: for passage scoring (BM25 default)
4. PassageFormatter: for formatting/mark-up
UnifiedHighlighter § Forked PH; given new name agnostic of offset source
§ Mostly same PH API; internals re-arranged and expanded
§ Solr adapter is nearly identical too
§ Untouched: Passage, PassageScorer, PassageFormatter
§ Re-uses some standard-Highlighter code too:
§ WeightedSpanTermExtractor (for phrase accuracy)
§ TokenStreamFromTermVector (for wildcards/MTQs)
UH: Accurate Phrases (including any SpanQuery)
§ Convert position-sensitive Queries to SpanQueries
§ Re-use WeightedSpanTermExtractor (WSTE) for this
§ Wrap PostingsEnum for position-sensitive words with one that filters by position-span extracted from span queries
§ Custom: WSTE is not used for this, although it’s similar
§ Note: not 100% accurate with query but very good
UH: Analysis Offset Source § The most difficult offset source…
§ Honor positionIncrementGap for multi-valued data
§ Populates a MemoryIndex when query has phrases
§ But smartly filters irrelevant terms! (new trick)
§ Wildcards/MTQs too? Uninvert MemoryIndex with re-used TokenStreamFromTermVector
§ If just terms, treat them like wildcards to avoid MemoryIndex usage
UH: Postings Plus Light TVs § Postings offset source is great, but not for MTQs (wildcards)
§ MTQs need to see all terms in just the document
§ A plain term vector (no offsets or postings) has that!
§ Trick:
§ Wrap the main index with a term vector’s TermsEnum
§ Then TokenStreamFromTermVector for MTQ
25
Benchmark Results 01
§ Unified Highlighter performed similarly or better than peers
§ Best performance: Postings with “light” Term Vectors
§ No use case for full term vectors anymore?
§ Caveats
§ Substantial variability in test runs (YMMV)
§ Depends on the specifics of your use case
§ Benchmark code available
Highlighter Offset Source Terms Phrases Wildcards
(search) N/A 1.0x! 1.0x! 1.0x
Standard Highlighter Analysis 4.6x! 4.7x! 7.4x
Unified Highlighter Analysis 2.8x! 2.4x! 3.7x
Standard Highlighter Term Vectors 2.7x! 2.3x! 3.7x
Fast Vector Highlighter Term Vectors 1.8x! 2.1x! 2.6x
Unified Highlighter Term Vectors 1.7x! 1.8x! 2.3x
Postings Highlighter Postings 1.8x! 1.5x! 3.8x
Unified Highlighter Postings 1.6x! 1.3x! 3.8x
Unified Highlighter Postings with Term Vectors* 1.5x! 1.3x! 2.2x
Times shown in multiples of the original search time (top row).
26
Future Potential Improvements § Accuracy
§ Switch from WSTE approach to SpanCollector API
§ Honor conjunctions “(X AND Y) OR Z”
§ Relevancy
§ Consider term diversity across top-X passages
§ Incorporate query boosts in passage scores
§ Support “requireFieldMatch=false”
01
Summary § Importance of highlighting in Legal Search
§ Overview of the existing Highlighters
§ Improvements to the Standard Highlighter
§ UnifiedHighlighter
§ Contributed to Lucene/Solr! LUCENE-7438
§ Your new favorite highlighter?
28
Questions? 01
Timothy M. RodriguezVerticals Search Team Lead, Bloomberg
@Timothy055
David SmileySearch Developer/Consultant
@DavidWSmiley