Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Solr Highlighting at Full Speed Timothy M. Rodriguez

Verticals Search Team Lead, Bloomberg David Smiley

Search Developer/Consultant

3

Agenda §  Legal Search

§  Business Requirements

§  Highlighters Overview

§  Improving the Standard Highlighter

§  Unified Highlighter

§  Questions

01

Bloomberg Law §  Suite of legal business tools

for lawyers and legal professionals

§  Business development

§  Drafting

§  Analytics

§  Search

5

Legal Search §  Recall Matters

§  Large Documents

§  Citizens United is 130 pages long

§  Some in the 100s of MB

§  Researchers rely on highlighting to help them decide if they should read a document

01

Requirements

ü

Accuracy §  Legal users issue detailed searches

§  “cafeteria plan” AND tax

§  Custom Span Queries

§  “insurance fraud” /s conviction

“Just Right” Digest Sizing

9

Full Document Highlighting 01

Zone Highlighting

11

Speed 01

12

Solr Highlighters Overview 01

Highlighter Offset Source Accuracy Speed

Default/Standard HighlighterAnalysis

BetterSlowest

Term Vectors Slow

Fast Vector Highlighter Term Vectors Good Medium

Postings Highlighter Postings (+ Analysis for wildcards) Okay Fast*

But poor wildcard performance

Offset Source & Index Size

§  Analysis requires no extra data on-disk

§  But analyzing text on the fly is expensive

§  Term vectors are heavy

§  Adding offsets to postings is much lighter than TVs

0 0.5 1 1.5 2 2.5 3Multiples of Stored Value

Stored Value

Terms

Positions

Offsets

TV Terms

TV Positions

TV Offsets

14

Initial Attempt §  Chose the default highlighter and added in customizations as needed

§  Added payload support to the MemoryIndex - LUCENE-6155 v5.0

§  We investigated using the PostingsHighlighter and FastVectorHighlighter but the accuracy trade-offs were not acceptable for our users

§  Ran into performance problems as highlighting was taking the bulk of our execution time

01

15

Make it Faster §  Improvements to the Standard Highlighter

§  Fast uninverting of term vectors to a token stream – LUCENE-6031 v5.0 (remove expensive sort)

§  Rely on Term Vectors for phrase highlighting instead of the analyzing into the MemoryIndex – LUCENE-6034 v5.0

Still not fast enough…

01

16

Multithreaded Highlighting §  Highlighting each returned doc is easily parallelizable

§  Greatly improved performance

§  Greatly increased memory consumption

Still a sizeable fraction of our query times…

01

17

Re-evaluating §  Didn’t look like we could make the Standard Highlighter much faster

§  Perhaps we could federate to one of the highlighters based on the query?

§  Our customizations would have to be ported to each of the highlighters

§  Work would need to be repeated 3x

§  Increased disk utilization from adding postings to the main index

01

18

Enhance the Postings Highlighter? §  Fastest of the bunch

§  Add accuracy at least as good as the standard highlighter

§  Add support for the other offset sources too

§  (supports our full-doc-highlighting use-case)

§  But it’s a big job with major internal highlighter surgery…

Let’s do it!

01

Offsets Overview §  Getting character offsets is key to highlighting. 3 ways:

§  Analysis: §  Analyzer è TokenStream è OffsetAttribute,

oa.startOffset()

§  Term vectors:§  IndexReader.getTermVector(docId,field) è Terms è

TermsEnum, te.postings(…, PostingsEnum.OFFSETS) è PostingsEnum è pe.startOffset()

§  Postings:§  IndexReader è LeafReader è Terms è … (see above)

PostingsHighlighter Algorithm 1.  Fetches all stored-value content needed up-front

2.  Highlight in field sorted order, then doc sorted order loop:

1.  Get PostingsEnum from a Terms for each query term

2.  MTQs: Fake PostingsEnum around filtered TokenStream

3.  Process PostingsEnum[ ] into Passage[ ]

java.text.BreakIterator: for passage delineation

PassageScorer: for passage scoring (BM25 default)

4.  PassageFormatter: for formatting/mark-up

UnifiedHighlighter §  Forked PH; given new name agnostic of offset source

§  Mostly same PH API; internals re-arranged and expanded

§  Solr adapter is nearly identical too

§  Untouched: Passage, PassageScorer, PassageFormatter

§  Re-uses some standard-Highlighter code too:

§  WeightedSpanTermExtractor (for phrase accuracy)

§  TokenStreamFromTermVector (for wildcards/MTQs)

UH: Accurate Phrases (including any SpanQuery)

§  Convert position-sensitive Queries to SpanQueries

§  Re-use WeightedSpanTermExtractor (WSTE) for this

§  Wrap PostingsEnum for position-sensitive words with one that filters by position-span extracted from span queries

§  Custom: WSTE is not used for this, although it’s similar

§  Note: not 100% accurate with query but very good

UH: Analysis Offset Source §  The most difficult offset source…

§  Honor positionIncrementGap for multi-valued data

§  Populates a MemoryIndex when query has phrases

§  But smartly filters irrelevant terms! (new trick)

§  Wildcards/MTQs too? Uninvert MemoryIndex with re-used TokenStreamFromTermVector

§  If just terms, treat them like wildcards to avoid MemoryIndex usage

UH: Postings Plus Light TVs §  Postings offset source is great, but not for MTQs (wildcards)

§  MTQs need to see all terms in just the document

§  A plain term vector (no offsets or postings) has that!

§  Trick:

§  Wrap the main index with a term vector’s TermsEnum

§  Then TokenStreamFromTermVector for MTQ

25

Benchmark Results 01

§  Unified Highlighter performed similarly or better than peers

§  Best performance: Postings with “light” Term Vectors

§  No use case for full term vectors anymore?

§  Caveats

§  Substantial variability in test runs (YMMV)

§  Depends on the specifics of your use case

§  Benchmark code available

Highlighter Offset Source Terms Phrases Wildcards

(search) N/A 1.0x! 1.0x! 1.0x

Standard Highlighter Analysis 4.6x! 4.7x! 7.4x

Unified Highlighter Analysis 2.8x! 2.4x! 3.7x

Standard Highlighter Term Vectors 2.7x! 2.3x! 3.7x

Fast Vector Highlighter Term Vectors 1.8x! 2.1x! 2.6x

Unified Highlighter Term Vectors 1.7x! 1.8x! 2.3x

Postings Highlighter Postings 1.8x! 1.5x! 3.8x

Unified Highlighter Postings 1.6x! 1.3x! 3.8x

Unified Highlighter Postings with Term Vectors* 1.5x! 1.3x! 2.2x

Times shown in multiples of the original search time (top row).

26

Future Potential Improvements §  Accuracy

§  Switch from WSTE approach to SpanCollector API

§  Honor conjunctions “(X AND Y) OR Z”

§  Relevancy

§  Consider term diversity across top-X passages

§  Incorporate query boosts in passage scores

§  Support “requireFieldMatch=false”

01

Summary §  Importance of highlighting in Legal Search

§  Overview of the existing Highlighters

§  Improvements to the Standard Highlighter

§  UnifiedHighlighter

§  Contributed to Lucene/Solr! LUCENE-7438

§  Your new favorite highlighter?

28

Questions? 01

Timothy M. RodriguezVerticals Search Team Lead, Bloomberg

@Timothy055

David SmileySearch Developer/Consultant

@DavidWSmiley

Technology

Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & David Smiley, D W Smiley, LLC