Search Engine-Building with Lucene and Solr
Part 2Kai Chan
SoCal Code Camp, November 2013
Overview
● indexing process● searching process● advanced features● scaling/redundancy● resources● demo● questions/answers
Indexing Process
● request handler○ data are read to create documents
● update request processor chain○ optional document-wide processing○ fields can be added, changed, removed○ analysis○ creation of indexed and stored fields
● update handler○ the index is updated
Update Request Processor Chain
● de-duplication○ creates a signature (hash) for each document to be
added○ replaces (delete) existing documents with the same
signature○ MD5Signature
■ exact hashing○ Lookup3Signature
■ faster calculation and smaller hash than MD5○ TextProfileSignature
■ fuzzy hashing, near-duplicate detection
Update Request Processor Chain
● language detection○ detects the language used in field(s)○ adds a language field to the document○ TikaLanguageIdentifierUpdateProcessorFa
ctory■ uses Apache Tika
○ LangDetectLanguageIdentifierUpdateProcessorFactory■ uses language-detection library
○ external programs■ e.g. Chromium Compact Language Detector
See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html>
Analysis
● analyzed○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)○ manipulation of tokens
● not analyzed○ the whole content treated as 1 unit for searching
● analyzed v.s. not analyzed○ are individual tokens meaningful on their own?○ are individual tokens used in queries?
1-933-98817-7
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
1 933 98817 7
Example 1: book title
Example 2: ISBN
Lucene in Action, Second Edition: Covers Apache Lucene 3.0
1 933 98817 7
makes more sense to not tokenize
makes more sense to tokenize
search for “Lucene”: no match
search for “933”: match
Analysis
analyzed:● text
How about URL?
not analyzed:● number● serial number● GUID● checksum
Analysis
● character filter(s)○ character replacement○ e.g. accent marks with their base forms
café → cafejalapeño → jalapeno
● tokenizer● token filter(s)
Analysis
● character filter(s)● tokenizer
○ create tokens (“words”) from characters○ sometimes straightforward○ many unusual cases:
e-mail address, URL, code, etc.● token filter(s)
Analysis
● character filter(s)● tokenizer● token filter(s)
○ token replacement■ change case, remove apostrophe■ remove stop words (a, and, the, for)■ split/join words (ice-cream, ice cream, icecream)■ stemming (importing, imported → import)■ synonym (nation → country)
Field value:Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free Wi-Fi!
Tokens (text_en):1 2 3 6 7 8 9 10 12 13 14 15 16 17let sign up amaz so cal code camp http bit.li ozizsu free wi fi
Tokens (text_en_splitting):1 2 3 6 7 8 9 10 12 13 14 1516 17 18 19 20let sign up amaz so cal code camp http bit ly o zi zsu free wi fi socal httpbitlyozizsu wifi 8 17 20
Tokens (text_general):1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 16 17let's sign up for the amazing so cal code camp at http bit.ly oZiZsu free wi fi
Searching Process
● query parsing● analysis● scoring● sorting● loading of stored fields● optional search components
○ faceting○ term vector○ More Like This○ highlighting
Scoring
● for a given query, each document not filtered out gets a score (float)
● higher score: higher in the results● scoring algorithms
○ default: TF-IDF○ other: Okapi BM25, etc.○ very customizable
See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
Scoring - TF-IDF
● term frequency (TF)○ how many times does this term appear in this
document?● inverse document frequency (IDF)
○ how many documents contain this term?○ score proportional to the inverse of document
frequency
Scoring - Other Factors
● coordination factor (coord)○ documents that contains all or most query terms get
higher scores● normalizing factor (norm)
○ adjust for field length and query complexity
Scoring - Boost
● manual override: ask Lucene/Solr to give a higher score to some particular thing(s)
● index-time○ per document○ per field (of a particular document)
● search-time○ per query
More Like This
● finds documents similar in content (of one field) to those matched
● constructs a query based on the highest scoring terms in a document
● requires the field to:○ have stored term vectors (recommended), or○ be stored
Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
Spell Checking
● typos in queries happen● returns spell checking suggestion (if any)
within the same result● can also be used for auto-complete
○ treating a prefix as a spelling mistake○ returning full words as suggestions
<lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset">14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst></lst>
/select?q=text:"busness comunication"&spellcheck=true&wt=xml
Query Elevation
● a.k.a. “sponsored search”● make sure certain documents appear at the
top of the results for a certain query
Credit: Google Web Search <http://www.google.com/>
Query Elevation
● configure the elevator search component in solrconfig.xml
● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude
● enable query elevation:enableElevation=true
● (optional) override the sort parameter:forceElevation=true
Function Query
● like formulas in Excel● apply functions to field values for filtering
and scoring
Function Query
● query:q={!func} cos(angle)
● query (range):q={!frange l=0.5 u=1} cos(angle)
● field:fl=angle,cos(angle)
● sort:sort=cos(angle) desc
Spatial Search
● data: contains locations (longitudes, latitudes)○ e.g. merchants with store locations
● search: filter and/or sort by location
Credit: Google Maps <http://maps.google.com/>
Spatial Search
● geofilt○ circle centered at a given point○ distance from a given point○ fq={!geofilt sfield=store}&pt=45.15,
-93.85&d=5● bbox
○ square (“bounding box”) centered at a given point○ distance from a given point + corners○ fq={!bbox sfield=store}&pt=45.15,-93.85
&d=5
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
x
o
o
x
x
xo
o
o
o
x
o
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
Spatial Search
● geodist○ returns the distance between the location given in a
field and a certain coordinate○ e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
Scaling/Redundancy - Problems
● collection too large for a single machine● too many requests for a single machine● a machine can go down
Scaling/Redundancy - Solutions
● collection too large for a single machine○ distribution
■ spread the collection across multiple machines● too many requests for a single machine
○ distribution■ spread the requests across multiple machines
● a machine can go down○ replication
■ copy data and configuration across multiple machines
■ make sure no single point of failure
SolrCloud
● Solr instances● ZooKeeper instances
SolrCloud
● Solr instances○ collection (logical index) divided into one or more
partial collections (“shards”)○ for each shard, one or more Solr instances keep
copies of the data■ one as leader - handles reads and writes■ others as replicas - handle reads
● ZooKeeper instances
SolrCloud
● Solr instances● ZooKeeper instances
○ management of Solr instances○ leader election○ node discovery
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
leader replica replica
(offline) leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
leader replica replica
replica leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
Resources - Books
● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/
● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/
● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-
cookbook/book
Resources - Books
● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/
● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/
Resources - Web
● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/
● mailing lists● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/
● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide
Getting Started
● download Solr○ requires Java 6 or newer to run
● Solr comes bundled/configured with Jetty○ <Solr directory>/example/start.jar
● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml● use the Solr admin interface
○ http://localhost:8983/solr/
Recommended