Upload
lucenerevolution
View
1.256
Download
1
Embed Size (px)
DESCRIPTION
Presented by Amit Nithianandan, Lead Engineer Search/Analytics New Platforms, Zvents/Stubhub Zvents has been a user of Apache Solr since 2007 when it was very early. Since then, the team has made extensive use of the various features and most recently completed an overhaul of the search engine to Solr 4.0. We'll touch on a variety of development/operational topics including how we manage the build lifecycle of the search application using Maven, release the deployment package using Capistrano and monitor using NewRelic as well as the extensive use of virtual machines to simplify node management. Also, we’ll talk about application level details such as our unique federated search product, and the integration of technologies such as Hypertable, RabbitMQ, and EHCache to power more real-time ranking and filtering based on traffic statistics and ticket inventory.
Citation preview
Solr @ Zvents: 6 years later
Amit Nithianandan Lead Engineer – Search and Analytics
About Me
My “Street Cred”
• Joined Zvents in Aug 2008 as member of search engineering team. – Knew nothing about Solr/Lucene (Lucene.. Isn’t that a
misspelling of the Safeway brand milk?)
• Worked on small features early on – New ranking configuration for “hot tickets” module on site.
• Worked on larger initiatives – Multiple re-writes of the federated search component – Recent upgrade to Solr 4.0
• Contribute to community – Authored a few articles/blog posts, most notable regarding
running Solr in Eclipse. – Wrote Chrome extension for easily editing long(Solr) API URLs
Overview
• About Zvents
• Why Solr?
• Search @ Zvents Details
• Federated Search discussion
• Integration with external data stores
• Development/Deployment
• Operations/Performance Details
About Zvents
• Helps people find fun things to do since 2005!
• Content sourced from a variety of places:
– Normal end users
– Internal content editors
– External content editors @ local newspapers
– Feeds
• Powers the events guide section of hundreds of local newspaper sites around the nation.
Technologies used (including but not limited to).
Why Solr?
• Flexible, Powerful, Customizable
• RESTful query API
• Scales reasonably well without hassle.
• Fast and easy to get started given the samples.
• Strong and active community
– Mailing list amazing. Conferences and meetups help too
Zvents Search at a quick glance…
• 2 Masters/10 Slaves Not sharded.
• Solr 4.x running on Jetty
• Six cores – Five host actual data, sixth used for federated search
• Federated search among eight different document types (i.e. venue, restaurants, movies…)
• Total number of documents ~5M documents
• We allow blank text (“what”) searches so people can look for stuff based on date and location
• How to surface the most relevant things to do?
11
Search Challenges
Document Design Notes
• Venues, Artists are as you would expect
• For movies, index each showtime pk = {theater_id, movie id, time} triple.
– When searching, filter by location, collapse on the movie_id sort by time asc
• For events, index each occurrence (time).
– When searching, collapse on a sequence_id sort by time asc to show the most recent upcoming event.
• Avoid showing visual “duplicates”
Request Flow
Zvents Search Service API
• Essentially Solr API with a few changes.
• ServletFilter and custom QueryComponent used to translate URL parameters to proper Solr parameter “syntax”
– E.g. latitude/longitude/radius converted to geospatial query and distance in km.
• Federated search executed using ThreadPool
– Parallel searches, results blended together.
Sample Query
http://localhost:8983/map_prod/select?qt=zvents&trim=1& start=0&start_spn=0&rows_spn=6&rows=10&zsort=0&rcity=San+Francisco&latitude=37.7752&longitude=-122.419&radius=75.0&category=event,event_spn,venue&sd=201212190000&fq:event:=has_city:true&fq:event_spn:=has_city:true&wt=ruby&q=the%20fillmore&fl=id,name,score&facet=true&indent=on
Category specific fq parameter
Lat/Long Params
Collapse results (grouping)
Sample Response (abbr) {
'organic'=>{
'response'=>{'numFound'=>379,'start'=>0,'maxScore'=>75.485054,'doc
s'=>[…]
},
'facet_counts'=>{
'facet_fields'=>{
'category'=>{
'event'=>67,
'venue'=>312}}},
'sponsored'=>{
'response'=>{'numFound'=>0,'start'=>0,'maxScore'=>0.0,'docs'=>[ …
]
},
'facet_counts'=>{
'facet_fields'=>{
'category'=>{
'event_spn'=>0}}}
}
Federated Search
Federated Search (notice movies + events mixed)
Federated Search (cont’d) • Zvents federator component executes multiple concurrent searches
and blends the results.
• Raw score meaningless across products so scores must be normalized so that across products they mean something.
• Division by max to yield 0-1 scale throws out the score distribution differences
• We chose to use the Z score (score – avg)/stddev.
• Getting stats like average and standard deviation on the results not trivial.
• Initially thought to hack the handler to put my own collector/scorer
PostFilter to the rescue!
• PostFilters allow you to (as the name suggests) execute filtering logic *after* the main query and all other filters have executed.
• Lucene filters + main query execute in parallel in a leap-frog manner. Some filters (i.e. filter by distance to user) are expensive to generate up front for all documents.
• You can create a delegate Collector to optionally call “super.collect()” if some condition is true.
• Since now I am at the lowest level of Lucene effectively (Collector/Scorer), I can store distribution information about the scores as they pass through the collector and custom scorer!
Example Result Snippet
<lst name="score_stats">
<float name="min">1.3786081E-6</float>
<float name="max">10.416486</float>
<float name="avg">1.8479956</float>
<float name="stdDev">1.544854</float>
<long name="numDocs">561</long>
<float name="sumSquaredScores">3254.7324</float>
<float name="sumScores">1036.7256</float>
</lst>
Federated Search – Victory! • Now the federator, when executing the product specific searches, can extract
this information to produce a “normal” score.
• Results from different products can be blended based on how good individual results are relative to their peers.
Ranking/Filtering using (highly) volatile data…
• Store data in field, re-index document constantly with updated field value
• Atomic updates? Solr 4.0 feature
– Claim ignorance here. Don’t know performance impacts nor usage.
• Use functions/FunctionQuery + pseudo-fields
– Instead of indexed click field, use clk() function.
• Use PostFilter to support filtering of documents based on this volatile data
Solr + External Data Store == Sweet!
Log Processing
Jetty Container
Solr Functions pull volatile data
from EhCache
Example: log(clk(EVENT,sequence_id))
Separate thread updates EhCache from
Hypertable
Filtering events based on ticket availability
Example: &fq={!ticket_filter idField=id}
Ticket availability publisher
EHCache
Publishes ticket information via AMQP
Jetty
Cache stores: {Event_id=>ticket_count}
1) Fetch ticket information.
2) Filter out document if ticket_count ==0
id 0
1
2
4
3
1245
Solr PostFilter
5678
Development and Deployment
Production Environment • Java 1.7
• Quad Core 2.8 GHz
• 10 GB RAM
– 8GB dedicated to JVM heap.
• All provisioned as VMs on VMWare ESX servers.
– Significantly simplifies cluster growth. Simply add servers and go!
• 10 Slaves, 2 Masters
– From configuration standpoint, masters == slave except masters have 4GB JVM heap instead of 8GB.
Solr Project Configuration
• Maven based – Treat Solr as dependency *not* as application.
• Other dependencies specified in POM, bundled into war during assembly phase.
• Build tarball that is pushed to Nexus – Tarball contains configuration scripts + Jetty jar
etc.
• Bundle Jetty with the app for all in one deployment.
Advantages of using Maven
• Solr version upgrades as simple as increasing dependency version in pom.xml. – Of course run tests before deploy!
• All dependencies managed by pom.xml and bundled into deployment artifact – No management of classpath via solrconfig.xml
• Take advantage of standard release management practices. Everything self contained.
Deployment via Capistrano
• Capistrano- Framework/Utility for executing commands in parallel via SSH on multiple servers (https://github.com/capistrano/capistrano)
• Capistrano-Nexus Gem- Zvents built gem to deploy a tarball hosted on a Nexus server out to staging/production.
Examples
• Staging/Development Deploy: – mvn deploy
– RELEASE=“2.10-SNAPSHOT” cap staging deploy
• Production Deploy: – mvn release:prepare
– mvn relesae:perform
– RELEASE=“2.10” cap production deploy
Monitoring- NewRelic
Monitoring- NewRelic (cont’d)
CONTACT
Amit Nithianandan
Anithian-at-gmail.com