Download pdf - Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Search Engine-Building with Lucene and Solr

Part 2Kai Chan

SoCal Code Camp, November 2013

Overview

● indexing process● searching process● advanced features● scaling/redundancy● resources● demo● questions/answers

Indexing Process

● request handler○ data are read to create documents

● update request processor chain○ optional document-wide processing○ fields can be added, changed, removed○ analysis○ creation of indexed and stored fields

● update handler○ the index is updated

Update Request Processor Chain

● de-duplication○ creates a signature (hash) for each document to be

added○ replaces (delete) existing documents with the same

signature○ MD5Signature

■ exact hashing○ Lookup3Signature

■ faster calculation and smaller hash than MD5○ TextProfileSignature

■ fuzzy hashing, near-duplicate detection

Update Request Processor Chain

● language detection○ detects the language used in field(s)○ adds a language field to the document○ TikaLanguageIdentifierUpdateProcessorFa

ctory■ uses Apache Tika

○ LangDetectLanguageIdentifierUpdateProcessorFactory■ uses language-detection library

○ external programs■ e.g. Chromium Compact Language Detector

See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html>

Analysis

● analyzed○ tokenization, i.e. breaking down the content to be

search into smaller units (“tokens”)○ manipulation of tokens

● not analyzed○ the whole content treated as 1 unit for searching

● analyzed v.s. not analyzed○ are individual tokens meaningful on their own?○ are individual tokens used in queries?

1-933-98817-7

Lucene in Action, Second Edition: Covers Apache Lucene 3.0


1 933 98817 7

Example 1: book title

Example 2: ISBN


1 933 98817 7

makes more sense to not tokenize

makes more sense to tokenize

search for “Lucene”: no match

search for “933”: match

Analysis

analyzed:● text

How about URL?

not analyzed:● number● serial number● GUID● checksum

Analysis

● character filter(s)○ character replacement○ e.g. accent marks with their base forms

café → cafejalapeño → jalapeno

● tokenizer● token filter(s)

Analysis

● character filter(s)● tokenizer

○ create tokens (“words”) from characters○ sometimes straightforward○ many unusual cases:

e-mail address, URL, code, etc.● token filter(s)

Analysis

● character filter(s)● tokenizer● token filter(s)

○ token replacement■ change case, remove apostrophe■ remove stop words (a, and, the, for)■ split/join words (ice-cream, ice cream, icecream)■ stemming (importing, imported → import)■ synonym (nation → country)

Field value:Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free Wi-Fi!

Tokens (text_en):1 2 3 6 7 8 9 10 12 13 14 15 16 17let sign up amaz so cal code camp http bit.li ozizsu free wi fi

Tokens (text_en_splitting):1 2 3 6 7 8 9 10 12 13 14 1516 17 18 19 20let sign up amaz so cal code camp http bit ly o zi zsu free wi fi socal httpbitlyozizsu wifi 8 17 20

Tokens (text_general):1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 16 17let's sign up for the amazing so cal code camp at http bit.ly oZiZsu free wi fi

Searching Process

● query parsing● analysis● scoring● sorting● loading of stored fields● optional search components

○ faceting○ term vector○ More Like This○ highlighting

Scoring

● for a given query, each document not filtered out gets a score (float)

● higher score: higher in the results● scoring algorithms

○ default: TF-IDF○ other: Okapi BM25, etc.○ very customizable

See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”

Scoring - TF-IDF

● term frequency (TF)○ how many times does this term appear in this

document?● inverse document frequency (IDF)

○ how many documents contain this term?○ score proportional to the inverse of document

frequency

Scoring - Other Factors

● coordination factor (coord)○ documents that contains all or most query terms get

higher scores● normalizing factor (norm)

○ adjust for field length and query complexity

Scoring - Boost

● manual override: ask Lucene/Solr to give a higher score to some particular thing(s)

● index-time○ per document○ per field (of a particular document)

● search-time○ per query

More Like This

● finds documents similar in content (of one field) to those matched

● constructs a query based on the highest scoring terms in a document

● requires the field to:○ have stored term vectors (recommended), or○ be stored

Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>

Spell Checking

● typos in queries happen● returns spell checking suggestion (if any)

within the same result● can also be used for auto-complete

○ treating a prefix as a spelling mistake○ returning full words as suggestions

<lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset">14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst></lst>

/select?q=text:"busness comunication"&spellcheck=true&wt=xml

Query Elevation

● a.k.a. “sponsored search”● make sure certain documents appear at the

top of the results for a certain query

Credit: Google Web Search <http://www.google.com/>

Query Elevation

● configure the elevator search component in solrconfig.xml

● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude

● enable query elevation:enableElevation=true

● (optional) override the sort parameter:forceElevation=true

Function Query

● like formulas in Excel● apply functions to field values for filtering

and scoring

Function Query

● query:q={!func} cos(angle)

● query (range):q={!frange l=0.5 u=1} cos(angle)

● field:fl=angle,cos(angle)

● sort:sort=cos(angle) desc

Spatial Search

● data: contains locations (longitudes, latitudes)○ e.g. merchants with store locations

● search: filter and/or sort by location

Credit: Google Maps <http://maps.google.com/>

Spatial Search

● geofilt○ circle centered at a given point○ distance from a given point○ fq={!geofilt sfield=store}&pt=45.15,

-93.85&d=5● bbox

○ square (“bounding box”) centered at a given point○ distance from a given point + corners○ fq={!bbox sfield=store}&pt=45.15,-93.85

&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)


geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

x

o

o

x

x

xo

o

o

o

x

o


Spatial Search

● geodist○ returns the distance between the location given in a

field and a certain coordinate○ e.g. sort by ascending distance from (45.15,-93.85),

and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc


Scaling/Redundancy - Problems

● collection too large for a single machine● too many requests for a single machine● a machine can go down

Scaling/Redundancy - Solutions

● collection too large for a single machine○ distribution

■ spread the collection across multiple machines● too many requests for a single machine

○ distribution■ spread the requests across multiple machines

● a machine can go down○ replication

■ copy data and configuration across multiple machines

■ make sure no single point of failure

SolrCloud

● Solr instances● ZooKeeper instances

SolrCloud

● Solr instances○ collection (logical index) divided into one or more

partial collections (“shards”)○ for each shard, one or more Solr instances keep

copies of the data■ one as leader - handles reads and writes■ others as replicas - handle reads

● ZooKeeper instances

SolrCloud

● Solr instances● ZooKeeper instances

○ management of Solr instances○ leader election○ node discovery

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection


collection (i.e. logical index)

replica

replica

replica


leader replica

leader replica





replica

replica

replica

replica


(offline) leader

leader replica





replica

replica

replica

replica


replica leader

leader replica





replica

replica

replica

replica

Resources - Books

● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/

● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-

cookbook/book

http://www.manning.com/hatcher3/

http://www.manning.com/hatcher3/

http://www.manning.com/grainger/

http://www.manning.com/grainger/

http://www.packtpub.com/apache-solr-4-cookbook/book



Resources - Books

● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/

http://nlp.stanford.edu/IR-book/

http://nlp.stanford.edu/IR-book/

http://mg4j.di.unimi.it/

http://mg4j.di.unimi.it/

Resources - Web

● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/

● mailing lists● Wiki sites

○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/

● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide

http://lucene.apache.org/core/

http://lucene.apache.org/solr/

http://wiki.apache.org/lucene-java/

http://wiki.apache.org/solr/

Getting Started

● download Solr○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar○ java -Durl=http://localhost:

8983/solr/update -jar post.jar *.xml● use the Solr admin interface

○ http://localhost:8983/solr/

http://localhost:8983/solr/

http://localhost:8983/solr/