Solr at zvents 6 years later & still going strong

Solr @ Zvents: 6 years later

Amit Nithianandan Lead Engineer – Search and Analytics

About Me

My “Street Cred”

• Joined Zvents in Aug 2008 as member of search engineering team. – Knew nothing about Solr/Lucene (Lucene.. Isn’t that a

misspelling of the Safeway brand milk?)

• Worked on small features early on – New ranking configuration for “hot tickets” module on site.

• Worked on larger initiatives – Multiple re-writes of the federated search component – Recent upgrade to Solr 4.0

• Contribute to community – Authored a few articles/blog posts, most notable regarding

running Solr in Eclipse. – Wrote Chrome extension for easily editing long(Solr) API URLs

Overview

• About Zvents

• Why Solr?

• Search @ Zvents Details

• Federated Search discussion

• Integration with external data stores

• Development/Deployment

• Operations/Performance Details

About Zvents

• Helps people find fun things to do since 2005!

• Content sourced from a variety of places:

– Normal end users

– Internal content editors

– External content editors @ local newspapers

– Feeds

• Powers the events guide section of hundreds of local newspaper sites around the nation.

Technologies used (including but not limited to).

Why Solr?

• Flexible, Powerful, Customizable

• RESTful query API

• Scales reasonably well without hassle.

• Fast and easy to get started given the samples.

• Strong and active community

– Mailing list amazing. Conferences and meetups help too

Zvents Search at a quick glance…

• 2 Masters/10 Slaves Not sharded.

• Solr 4.x running on Jetty

• Six cores – Five host actual data, sixth used for federated search

• Federated search among eight different document types (i.e. venue, restaurants, movies…)

• Total number of documents ~5M documents

• We allow blank text (“what”) searches so people can look for stuff based on date and location

• How to surface the most relevant things to do?

11

Search Challenges

Document Design Notes

• Venues, Artists are as you would expect

• For movies, index each showtime pk = {theater_id, movie id, time} triple.

– When searching, filter by location, collapse on the movie_id sort by time asc

• For events, index each occurrence (time).

– When searching, collapse on a sequence_id sort by time asc to show the most recent upcoming event.

• Avoid showing visual “duplicates”

Request Flow

Zvents Search Service API

• Essentially Solr API with a few changes.

• ServletFilter and custom QueryComponent used to translate URL parameters to proper Solr parameter “syntax”

– E.g. latitude/longitude/radius converted to geospatial query and distance in km.

• Federated search executed using ThreadPool

– Parallel searches, results blended together.

Sample Query

http://localhost:8983/map_prod/select?qt=zvents&trim=1& start=0&start_spn=0&rows_spn=6&rows=10&zsort=0&rcity=San+Francisco&latitude=37.7752&longitude=-122.419&radius=75.0&category=event,event_spn,venue&sd=201212190000&fq:event:=has_city:true&fq:event_spn:=has_city:true&wt=ruby&q=the%20fillmore&fl=id,name,score&facet=true&indent=on

Category specific fq parameter

Lat/Long Params

Collapse results (grouping)

Sample Response (abbr) {

'organic'=>{

'response'=>{'numFound'=>379,'start'=>0,'maxScore'=>75.485054,'doc

s'=>[…]

},

'facet_counts'=>{

'facet_fields'=>{

'category'=>{

'event'=>67,

'venue'=>312}}},

'sponsored'=>{

'response'=>{'numFound'=>0,'start'=>0,'maxScore'=>0.0,'docs'=>[ …

]

},

'facet_counts'=>{

'facet_fields'=>{

'category'=>{

'event_spn'=>0}}}

}

Federated Search

Federated Search (notice movies + events mixed)

Federated Search (cont’d) • Zvents federator component executes multiple concurrent searches

and blends the results.

• Raw score meaningless across products so scores must be normalized so that across products they mean something.

• Division by max to yield 0-1 scale throws out the score distribution differences

• We chose to use the Z score (score – avg)/stddev.

• Getting stats like average and standard deviation on the results not trivial.

• Initially thought to hack the handler to put my own collector/scorer

PostFilter to the rescue!

• PostFilters allow you to (as the name suggests) execute filtering logic *after* the main query and all other filters have executed.

• Lucene filters + main query execute in parallel in a leap-frog manner. Some filters (i.e. filter by distance to user) are expensive to generate up front for all documents.

• You can create a delegate Collector to optionally call “super.collect()” if some condition is true.

• Since now I am at the lowest level of Lucene effectively (Collector/Scorer), I can store distribution information about the scores as they pass through the collector and custom scorer!

Example Result Snippet

<lst name="score_stats">

<float name="min">1.3786081E-6</float>

<float name="max">10.416486</float>

<float name="avg">1.8479956</float>

<float name="stdDev">1.544854</float>

<long name="numDocs">561</long>

<float name="sumSquaredScores">3254.7324</float>

<float name="sumScores">1036.7256</float>

</lst>

Federated Search – Victory! • Now the federator, when executing the product specific searches, can extract

this information to produce a “normal” score.

• Results from different products can be blended based on how good individual results are relative to their peers.

Ranking/Filtering using (highly) volatile data…

• Store data in field, re-index document constantly with updated field value

• Atomic updates? Solr 4.0 feature

– Claim ignorance here. Don’t know performance impacts nor usage.

• Use functions/FunctionQuery + pseudo-fields

– Instead of indexed click field, use clk() function.

• Use PostFilter to support filtering of documents based on this volatile data

Solr + External Data Store == Sweet!

Log Processing

Jetty Container

Solr Functions pull volatile data

from EhCache

Example: log(clk(EVENT,sequence_id))

Separate thread updates EhCache from

Hypertable

Filtering events based on ticket availability

Example: &fq={!ticket_filter idField=id}

Ticket availability publisher

EHCache

Publishes ticket information via AMQP

Jetty

Cache stores: {Event_id=>ticket_count}

1) Fetch ticket information.

2) Filter out document if ticket_count ==0

id 0

1

2

4

3

1245

Solr PostFilter

5678

Development and Deployment

Production Environment • Java 1.7

• Quad Core 2.8 GHz

• 10 GB RAM

– 8GB dedicated to JVM heap.

• All provisioned as VMs on VMWare ESX servers.

– Significantly simplifies cluster growth. Simply add servers and go!

• 10 Slaves, 2 Masters

– From configuration standpoint, masters == slave except masters have 4GB JVM heap instead of 8GB.

Solr Project Configuration

• Maven based – Treat Solr as dependency *not* as application.

• Other dependencies specified in POM, bundled into war during assembly phase.

• Build tarball that is pushed to Nexus – Tarball contains configuration scripts + Jetty jar

etc.

• Bundle Jetty with the app for all in one deployment.

Advantages of using Maven

• Solr version upgrades as simple as increasing dependency version in pom.xml. – Of course run tests before deploy!

• All dependencies managed by pom.xml and bundled into deployment artifact – No management of classpath via solrconfig.xml

• Take advantage of standard release management practices. Everything self contained.

Deployment via Capistrano

• Capistrano- Framework/Utility for executing commands in parallel via SSH on multiple servers (https://github.com/capistrano/capistrano)

• Capistrano-Nexus Gem- Zvents built gem to deploy a tarball hosted on a Nexus server out to staging/production.

https://github.com/capistrano/capistrano

https://github.com/capistrano/capistrano

Examples

• Staging/Development Deploy: – mvn deploy

– RELEASE=“2.10-SNAPSHOT” cap staging deploy

• Production Deploy: – mvn release:prepare

– mvn relesae:perform

– RELEASE=“2.10” cap production deploy

Monitoring- NewRelic

Monitoring- NewRelic (cont’d)

CONTACT

Amit Nithianandan

Anithian-at-gmail.com

Technology

Solr at zvents 6 years later & still going strong