Upload
daniellindsley
View
969
Download
1
Embed Size (px)
DESCRIPTION
From a November 2010 training session with CMG Digital.
Citation preview
Getting The MostOut Of Haystack
Daniel LindsleyPragmatic Badger, LLC
Tuesday, December 28, 2010
Terminology
Tuesday, December 28, 2010
“Engine”• The actual search engine
• Here be interesting computer science problems
• Examples: Solr, Xapian, Whoosh
Tuesday, December 28, 2010
“Document”• A single record in the index
• Usually accompanied by 1+ fields of metadata
• Heavily processed
Tuesday, December 28, 2010
“Corpus”• The collection of indexed documents
• Latin for “body”
Tuesday, December 28, 2010
“Stemming”• Find the root of the word
• Part of the “magic” of search
• More on this later...
Tuesday, December 28, 2010
“Relevance”• A metric of how well a document matches
the query
• Search’s killer feature
• Hard to get 100% right
Tuesday, December 28, 2010
“Faceting”• Count of docs meeting certain criteria
within your result set
• Drill down!
• Think Amazon/eBay
• More on this later...
Tuesday, December 28, 2010
“Boost”• A way to artificially increase the relevance of
document
• Types: Document/Field/Term
Tuesday, December 28, 2010
Introduction toSearch
Tuesday, December 28, 2010
Search != RDBMS• The sooner you get over that, the easier
everything that follows will be.
• Think “document store”.
Tuesday, December 28, 2010
Stemming• Porter-Stemmer or Snowball
• The engine takes terms & hacks them down to the root word.
• Examples:
“testing” ! “test”
“searchers” ! “searcher” ! “search”
Tuesday, December 28, 2010
Inverted Index• The power of the engine starts here
• Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term
...“search”: [3, 104, 238],...
Tuesday, December 28, 2010
Inverted Index• Very fast lookups
• NOT a “contains” or “like” lookup unless you say so (slower)
Tuesday, December 28, 2010
Document Store• Flat structure
• Generally free-form/schema-less
• Easiest to think about each record as a dictionary
• No relations built-in
Tuesday, December 28, 2010
Why custom search?...or...
“Isn’t this what Google is for?”
Tuesday, December 28, 2010
Why custom search?• You control what is (and is not) indexed
Tuesday, December 28, 2010
Why custom search?• You control what is (and is not) indexed
• Better quality data goes into the index
Tuesday, December 28, 2010
Why custom search?• You control what is (and is not) indexed
• Better quality data goes into the index
• Information-specific handling
Tuesday, December 28, 2010
Why custom search?• You control what is (and is not) indexed
• Better quality data goes into the index
• Information-specific handling
• Provide context-specific search
Tuesday, December 28, 2010
Introduction toHaystack
Tuesday, December 28, 2010
What is Haystack?At its simplest, Haystack is an abstraction layer for integrating Django with a search engine.
Tuesday, December 28, 2010
Why Haystack?• Familiar API
• Declarative
• “Looks” like Django
Tuesday, December 28, 2010
Why Haystack?• Pluggable Backends
• Support Solr & Whoosh out of the box, Xapian with a third-party backend (boo GPL!)
• Your code stays the same regardless of backend.
Tuesday, December 28, 2010
Why Haystack?• Advanced Features
• Faceting
• More Like This
• Highlighting
• Boost
Tuesday, December 28, 2010
Why Haystack?• Integration with third-party apps
• No need to fork their code
• Put the indexes in your code & register them
• Applies to django.contrib as well.
Tuesday, December 28, 2010
Why Haystack?• Real Live Documentation™!
• http://docs.haystacksearch.org/dev/
• Test Coverage!
• Decent coverage
• No new commits without tests
Tuesday, December 28, 2010
Enough shameless self-promotion
already!
Tuesday, December 28, 2010
UsingHaystack
Tuesday, December 28, 2010
Two Phase Approach• The “Data In” is SearchIndex
• The “Data Out” is SearchQuerySet
• Note: There’s a disconnect between your database & the search index
Tuesday, December 28, 2010
SearchIndex
Tuesday, December 28, 2010
SearchIndex• Provides the means to get data into the
index
• Something of a cross between a Form (the data preparation aspects) and Model (the persistence)
Tuesday, December 28, 2010
SearchIndexfrom haystack import indexes, sitefrom myapp.models import Entry
class EntrySearchIndex(indexes.SearchIndex):text = indexes.CharField(document=True, use_template=True)author = indexes.CharField(model_attr=‘user__username’)created = indexes.DateTimeField()
def get_queryset(self):return Entry.objects.published()
def prepare_created(self, obj):return obj.pub_date or datetime.datetime.now()
site.register(Entry, EntrySearchIndex)
Tuesday, December 28, 2010
`use_template=True`?• Use Django templates to prep the data
• Example:# search/indexes/myapp/entry_text.txt{{ obj.title }}{{ obj.author.get_full_name }}{{ obj.tease }}{{ obj.content }}
Tuesday, December 28, 2010
SearchQuerySet
Tuesday, December 28, 2010
SearchQuerySet• The reason to use Haystack
• Very powerful
• Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet
Tuesday, December 28, 2010
SearchQuerySet• Fetches data from the index
• Very similar to QuerySet
• Intentional, to reduce conceptual overhead
• Lazily evaluated
• Chain methods
Tuesday, December 28, 2010
SearchQuerySet• By default, searches across all models
• Can limit using SearchQuerySet.models
• Caches where possible
Tuesday, December 28, 2010
SearchQuerySet>>> import datetime>>> from haystack.query import SearchQuerySet>>> sqs = SearchQuerySet().models(Entry)>>> sqs = sqs.filter(created__lte=datetime.datetime.now())>>> sqs = sqs.exclude(author=‘daniel’)
# Lazily performed the query when asked for results.>>> sqs[<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>]
# Iterable interface.# Still hasn’t hit the DB.>>> [result.author for result in sqs][‘johndoe’, ‘sally1982’, ‘bob_the_third’]
Tuesday, December 28, 2010
SearchQuerySet# Hits the database once per result.>>> [result.object.user.first_name for result in sqs][‘John’, ‘Sally’, ‘Bob’]
# More efficient loading from database (one query total).>>> [result.object.user.first_name for result in sqs.load_all()][‘John’, ‘Sally’, ‘Bob’]
Tuesday, December 28, 2010
SearchView
Tuesday, December 28, 2010
SearchView• Class-based view
• Hit 80% of the regular usage
• A guideline to more advanced use
• Relies heavily on SearchForm
Tuesday, December 28, 2010
SearchForm
Tuesday, December 28, 2010
SearchForm• Outside of using SearchQuerySet, it’s a
standard Django form
• Defines a search method that does the necessary actions
Tuesday, December 28, 2010
SearchFormfrom django import formsfrom haystack.forms import SearchFormfrom myapp.models import Entry
class EntrySearchForm(SearchForm):# Additional fields go here.author = forms.CharField(max_length=255, required=False)
def search(self):sqs = super(EntrySearchForm, self).search()
if self.cleaned_data.get(‘author’):sqs = sqs.filter(author=self.cleaned_data[‘author’])
return sqs
Tuesday, December 28, 2010
SearchSite
Tuesday, December 28, 2010
SearchSite• Registry pattern
• Collects all registered SearchIndex classes
• Used by SearchQuerySet to limit results to only things Haystack knows about
• Think django.contrib.admin.site.
Tuesday, December 28, 2010
HaystackBest Practices
Tuesday, December 28, 2010
Common Fields• Try to find common fields as much as
possible
• Reuse where it makes sense
• But don’t shoehorn if it doesn’t work
Tuesday, December 28, 2010
It’s Just Python• When an out-of-box doesn't work for you,
use SearchQuerySet & write what you need.
• It's just Django & Python.
Tuesday, December 28, 2010
load_all• Appropriate use of
SearchQuerySet.load_all
• One hit to the DB per content type
• But do you need to hit the DB?
Tuesday, December 28, 2010
More Like This• Cheap & very worth it
• LJWorld saw a 30% jump in traffic by adding it solely on story detail views.
• Cache it!
Tuesday, December 28, 2010
“Third Party” Apps• queued_search
• https://github.com/toastdriven/queued_search
• saved_searches
• https://github.com/toastdriven/saved_searches
Tuesday, December 28, 2010
Other Ideas• Admin Integration
• Integration with API
• Search “grouping”
• Vertical search
Tuesday, December 28, 2010
SolrBest Practices
Tuesday, December 28, 2010
Tomcat vs. Jetty• Very close performance-wise
• Tomcat better when busy
• Jetty is smaller on RAM & easier to run
Tuesday, December 28, 2010
Tune JVM settings-Xms (Minimum size)-Xmx (Maximum size)
# Something close to...- ``java -Xms1G -Xmx12G -jar start.jar``
- -XX:+PrintGCDetails (print GC info)- -XX:+PrintGCTimeStamps (print GC info + timestamps)
Tuesday, December 28, 2010
JMX Console• java -Dcom.sun.management.jmxremote -jar start.jar
• Then jconsole
• Find jetty in the process list.
• Lots of instrumentation
Tuesday, December 28, 2010
• Proper query warming
• The default “solr rocks” doesn’t.
• Remove unused handlers (like partition)
Tune solrconfig
Tuesday, December 28, 2010
Tune solrconfig• Tuning the mergeFactor
• Not too high, not too low
• Big trade-off
Tuesday, December 28, 2010
Schema• use omitNorms where possible
• Only needed on full-text fields
• Same goes for indexed & stored
• The fewer fields, the better
Tuesday, December 28, 2010
Optimize!• Seriously.
• Goes back through existing indexes & cleans up
• Takes awhile to run, so make sure your timeout is high (custom settings file)
Tuesday, December 28, 2010
Commits• Commit as infrequently as is reasonable
• Commit as much as you can at once
• queued_search shines here
Tuesday, December 28, 2010
Debugging• Use &debugQuery=on to debug queries
• Use the browser interface!
Tuesday, December 28, 2010
Advanced Bits• Learn & love the Solr stats page
Tuesday, December 28, 2010
Advanced Bits• Learn & love the Solr stats page
• Replication
Tuesday, December 28, 2010
Advanced Bits• Learn & love the Solr stats page
• Replication
• n-gram based autocomplete
Tuesday, December 28, 2010
Advanced Bits• Learn & love the Solr stats page
• Replication
• n-gram based autocomplete
• Spelling suggestions
• the (Haystack) documented config sucks
Tuesday, December 28, 2010
Advanced Bits• Learn & love the Solr stats page
• Replication
• n-gram based autocomplete
• Spelling suggestions
• the (Haystack) documented config sucks
• Dismax Handler
Tuesday, December 28, 2010
Resources• https://gist.github.com/215331
• http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
• http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/
• http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html
• http://wiki.apache.org/solr/SolrJmx
• http://wiki.apache.org/solr/LargeIndexes
• http://wiki.apache.org/solr/SolrPerformanceFactors
Tuesday, December 28, 2010
Resources• http://wiki.apache.org/solr/SolrReplication
• http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/
• http://charlesleifer.com/blog/search-on-djangosnippetsorg/
• http://wiki.apache.org/solr/SpellCheckComponent
Tuesday, December 28, 2010
Enough Talk.Let’s Go Work With It.
Tuesday, December 28, 2010
A Big Thanks ToCMG Digital &
@cmheisel For Having Me!
Tuesday, December 28, 2010
http://haystacksearch.org/http://github.com/toastdriven/django-haystack
#haystack on irc.freenode.nethttp://groups.google.com/group/django-haystack/
@daniellindsley on Twitter
More Information
Tuesday, December 28, 2010