33
Solr @ Zvents: 6 years later Amit Nithianandan Lead Engineer Search and Analytics

Solr at zvents 6 years later & still going strong

Embed Size (px)

DESCRIPTION

Presented by Amit Nithianandan, Lead Engineer Search/Analytics New Platforms, Zvents/Stubhub Zvents has been a user of Apache Solr since 2007 when it was very early. Since then, the team has made extensive use of the various features and most recently completed an overhaul of the search engine to Solr 4.0. We'll touch on a variety of development/operational topics including how we manage the build lifecycle of the search application using Maven, release the deployment package using Capistrano and monitor using NewRelic as well as the extensive use of virtual machines to simplify node management. Also, we’ll talk about application level details such as our unique federated search product, and the integration of technologies such as Hypertable, RabbitMQ, and EHCache to power more real-time ranking and filtering based on traffic statistics and ticket inventory.

Citation preview

Page 1: Solr at zvents   6 years later & still going strong

Solr @ Zvents: 6 years later

Amit Nithianandan Lead Engineer – Search and Analytics

Page 2: Solr at zvents   6 years later & still going strong
Page 3: Solr at zvents   6 years later & still going strong

About Me

Page 4: Solr at zvents   6 years later & still going strong

My “Street Cred”

• Joined Zvents in Aug 2008 as member of search engineering team. – Knew nothing about Solr/Lucene (Lucene.. Isn’t that a

misspelling of the Safeway brand milk?)

• Worked on small features early on – New ranking configuration for “hot tickets” module on site.

• Worked on larger initiatives – Multiple re-writes of the federated search component – Recent upgrade to Solr 4.0

• Contribute to community – Authored a few articles/blog posts, most notable regarding

running Solr in Eclipse. – Wrote Chrome extension for easily editing long(Solr) API URLs

Page 5: Solr at zvents   6 years later & still going strong

Overview

• About Zvents

• Why Solr?

• Search @ Zvents Details

• Federated Search discussion

• Integration with external data stores

• Development/Deployment

• Operations/Performance Details

Page 6: Solr at zvents   6 years later & still going strong
Page 7: Solr at zvents   6 years later & still going strong

About Zvents

• Helps people find fun things to do since 2005!

• Content sourced from a variety of places:

– Normal end users

– Internal content editors

– External content editors @ local newspapers

– Feeds

• Powers the events guide section of hundreds of local newspaper sites around the nation.

Page 8: Solr at zvents   6 years later & still going strong

Technologies used (including but not limited to).

Page 9: Solr at zvents   6 years later & still going strong

Why Solr?

• Flexible, Powerful, Customizable

• RESTful query API

• Scales reasonably well without hassle.

• Fast and easy to get started given the samples.

• Strong and active community

– Mailing list amazing. Conferences and meetups help too

Page 10: Solr at zvents   6 years later & still going strong

Zvents Search at a quick glance…

• 2 Masters/10 Slaves Not sharded.

• Solr 4.x running on Jetty

• Six cores – Five host actual data, sixth used for federated search

• Federated search among eight different document types (i.e. venue, restaurants, movies…)

• Total number of documents ~5M documents

Page 11: Solr at zvents   6 years later & still going strong

• We allow blank text (“what”) searches so people can look for stuff based on date and location

• How to surface the most relevant things to do?

11

Search Challenges

Page 12: Solr at zvents   6 years later & still going strong

Document Design Notes

• Venues, Artists are as you would expect

• For movies, index each showtime pk = {theater_id, movie id, time} triple.

– When searching, filter by location, collapse on the movie_id sort by time asc

• For events, index each occurrence (time).

– When searching, collapse on a sequence_id sort by time asc to show the most recent upcoming event.

• Avoid showing visual “duplicates”

Page 13: Solr at zvents   6 years later & still going strong

Request Flow

Page 14: Solr at zvents   6 years later & still going strong

Zvents Search Service API

• Essentially Solr API with a few changes.

• ServletFilter and custom QueryComponent used to translate URL parameters to proper Solr parameter “syntax”

– E.g. latitude/longitude/radius converted to geospatial query and distance in km.

• Federated search executed using ThreadPool

– Parallel searches, results blended together.

Page 15: Solr at zvents   6 years later & still going strong

Sample Query

http://localhost:8983/map_prod/select?qt=zvents&trim=1& start=0&start_spn=0&rows_spn=6&rows=10&zsort=0&rcity=San+Francisco&latitude=37.7752&longitude=-122.419&radius=75.0&category=event,event_spn,venue&sd=201212190000&fq:event:=has_city:true&fq:event_spn:=has_city:true&wt=ruby&q=the%20fillmore&fl=id,name,score&facet=true&indent=on

Category specific fq parameter

Lat/Long Params

Collapse results (grouping)

Page 16: Solr at zvents   6 years later & still going strong

Sample Response (abbr) {

'organic'=>{

'response'=>{'numFound'=>379,'start'=>0,'maxScore'=>75.485054,'doc

s'=>[…]

},

'facet_counts'=>{

'facet_fields'=>{

'category'=>{

'event'=>67,

'venue'=>312}}},

'sponsored'=>{

'response'=>{'numFound'=>0,'start'=>0,'maxScore'=>0.0,'docs'=>[ …

]

},

'facet_counts'=>{

'facet_fields'=>{

'category'=>{

'event_spn'=>0}}}

}

Page 17: Solr at zvents   6 years later & still going strong

Federated Search

Federated Search (notice movies + events mixed)

Page 18: Solr at zvents   6 years later & still going strong

Federated Search (cont’d) • Zvents federator component executes multiple concurrent searches

and blends the results.

• Raw score meaningless across products so scores must be normalized so that across products they mean something.

• Division by max to yield 0-1 scale throws out the score distribution differences

• We chose to use the Z score (score – avg)/stddev.

• Getting stats like average and standard deviation on the results not trivial.

• Initially thought to hack the handler to put my own collector/scorer

Page 19: Solr at zvents   6 years later & still going strong

PostFilter to the rescue!

• PostFilters allow you to (as the name suggests) execute filtering logic *after* the main query and all other filters have executed.

• Lucene filters + main query execute in parallel in a leap-frog manner. Some filters (i.e. filter by distance to user) are expensive to generate up front for all documents.

• You can create a delegate Collector to optionally call “super.collect()” if some condition is true.

• Since now I am at the lowest level of Lucene effectively (Collector/Scorer), I can store distribution information about the scores as they pass through the collector and custom scorer!

Page 20: Solr at zvents   6 years later & still going strong

Example Result Snippet

<lst name="score_stats">

<float name="min">1.3786081E-6</float>

<float name="max">10.416486</float>

<float name="avg">1.8479956</float>

<float name="stdDev">1.544854</float>

<long name="numDocs">561</long>

<float name="sumSquaredScores">3254.7324</float>

<float name="sumScores">1036.7256</float>

</lst>

Page 21: Solr at zvents   6 years later & still going strong

Federated Search – Victory! • Now the federator, when executing the product specific searches, can extract

this information to produce a “normal” score.

• Results from different products can be blended based on how good individual results are relative to their peers.

Page 22: Solr at zvents   6 years later & still going strong

Ranking/Filtering using (highly) volatile data…

• Store data in field, re-index document constantly with updated field value

• Atomic updates? Solr 4.0 feature

– Claim ignorance here. Don’t know performance impacts nor usage.

• Use functions/FunctionQuery + pseudo-fields

– Instead of indexed click field, use clk() function.

• Use PostFilter to support filtering of documents based on this volatile data

Page 23: Solr at zvents   6 years later & still going strong

Solr + External Data Store == Sweet!

Log Processing

Jetty Container

Solr Functions pull volatile data

from EhCache

Example: log(clk(EVENT,sequence_id))

Separate thread updates EhCache from

Hypertable

Page 24: Solr at zvents   6 years later & still going strong

Filtering events based on ticket availability

Example: &fq={!ticket_filter idField=id}

Ticket availability publisher

EHCache

Publishes ticket information via AMQP

Jetty

Cache stores: {Event_id=>ticket_count}

1) Fetch ticket information.

2) Filter out document if ticket_count ==0

id 0

1

2

4

3

1245

Solr PostFilter

5678

Page 25: Solr at zvents   6 years later & still going strong

Development and Deployment

Page 26: Solr at zvents   6 years later & still going strong

Production Environment • Java 1.7

• Quad Core 2.8 GHz

• 10 GB RAM

– 8GB dedicated to JVM heap.

• All provisioned as VMs on VMWare ESX servers.

– Significantly simplifies cluster growth. Simply add servers and go!

• 10 Slaves, 2 Masters

– From configuration standpoint, masters == slave except masters have 4GB JVM heap instead of 8GB.

Page 27: Solr at zvents   6 years later & still going strong

Solr Project Configuration

• Maven based – Treat Solr as dependency *not* as application.

• Other dependencies specified in POM, bundled into war during assembly phase.

• Build tarball that is pushed to Nexus – Tarball contains configuration scripts + Jetty jar

etc.

• Bundle Jetty with the app for all in one deployment.

Page 28: Solr at zvents   6 years later & still going strong

Advantages of using Maven

• Solr version upgrades as simple as increasing dependency version in pom.xml. – Of course run tests before deploy!

• All dependencies managed by pom.xml and bundled into deployment artifact – No management of classpath via solrconfig.xml

• Take advantage of standard release management practices. Everything self contained.

Page 29: Solr at zvents   6 years later & still going strong

Deployment via Capistrano

• Capistrano- Framework/Utility for executing commands in parallel via SSH on multiple servers (https://github.com/capistrano/capistrano)

• Capistrano-Nexus Gem- Zvents built gem to deploy a tarball hosted on a Nexus server out to staging/production.

Page 30: Solr at zvents   6 years later & still going strong

Examples

• Staging/Development Deploy: – mvn deploy

– RELEASE=“2.10-SNAPSHOT” cap staging deploy

• Production Deploy: – mvn release:prepare

– mvn relesae:perform

– RELEASE=“2.10” cap production deploy

Page 31: Solr at zvents   6 years later & still going strong

Monitoring- NewRelic

Page 32: Solr at zvents   6 years later & still going strong

Monitoring- NewRelic (cont’d)

Page 33: Solr at zvents   6 years later & still going strong

CONTACT

Amit Nithianandan

Anithian-at-gmail.com