Haystack Training

Getting The MostOut Of Haystack

Daniel LindsleyPragmatic Badger, LLC

Tuesday, December 28, 2010

Terminology


“Engine”• The actual search engine

• Here be interesting computer science problems

• Examples: Solr, Xapian, Whoosh


“Document”• A single record in the index

• Usually accompanied by 1+ fields of metadata

• Heavily processed


“Corpus”• The collection of indexed documents

• Latin for “body”


“Stemming”• Find the root of the word

• Part of the “magic” of search

• More on this later...


“Relevance”• A metric of how well a document matches

the query

• Search’s killer feature

• Hard to get 100% right


“Faceting”• Count of docs meeting certain criteria

within your result set

• Drill down!

• Think Amazon/eBay

• More on this later...


“Boost”• A way to artificially increase the relevance of

document

• Types: Document/Field/Term


Introduction toSearch


Search != RDBMS• The sooner you get over that, the easier

everything that follows will be.

• Think “document store”.


Stemming• Porter-Stemmer or Snowball

• The engine takes terms & hacks them down to the root word.

• Examples:

“testing” ! “test”

“searchers” ! “searcher” ! “search”


Inverted Index• The power of the engine starts here

• Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term

...“search”: [3, 104, 238],...


Inverted Index• Very fast lookups

• NOT a “contains” or “like” lookup unless you say so (slower)


Document Store• Flat structure

• Generally free-form/schema-less

• Easiest to think about each record as a dictionary

• No relations built-in


Why custom search?...or...

“Isn’t this what Google is for?”


Why custom search?• You control what is (and is not) indexed



• Better quality data goes into the index




• Information-specific handling




• Information-specific handling

• Provide context-specific search


Introduction toHaystack


What is Haystack?At its simplest, Haystack is an abstraction layer for integrating Django with a search engine.


Why Haystack?• Familiar API

• Declarative

• “Looks” like Django


Why Haystack?• Pluggable Backends

• Support Solr & Whoosh out of the box, Xapian with a third-party backend (boo GPL!)

• Your code stays the same regardless of backend.


Why Haystack?• Advanced Features

• Faceting

• More Like This

• Highlighting

• Boost


Why Haystack?• Integration with third-party apps

• No need to fork their code

• Put the indexes in your code & register them

• Applies to django.contrib as well.


Why Haystack?• Real Live Documentation™!

• http://docs.haystacksearch.org/dev/

• Test Coverage!

• Decent coverage

• No new commits without tests


http://docs.haystacksearch.org/dev/

http://docs.haystacksearch.org/dev/

Enough shameless self-promotion

already!


UsingHaystack


Two Phase Approach• The “Data In” is SearchIndex

• The “Data Out” is SearchQuerySet

• Note: There’s a disconnect between your database & the search index


SearchIndex


SearchIndex• Provides the means to get data into the

index

• Something of a cross between a Form (the data preparation aspects) and Model (the persistence)


SearchIndexfrom haystack import indexes, sitefrom myapp.models import Entry

class EntrySearchIndex(indexes.SearchIndex):text = indexes.CharField(document=True, use_template=True)author = indexes.CharField(model_attr=‘user__username’)created = indexes.DateTimeField()

def get_queryset(self):return Entry.objects.published()

def prepare_created(self, obj):return obj.pub_date or datetime.datetime.now()

site.register(Entry, EntrySearchIndex)


`use_template=True`?• Use Django templates to prep the data

• Example:# search/indexes/myapp/entry_text.txt{{ obj.title }}{{ obj.author.get_full_name }}{{ obj.tease }}{{ obj.content }}


SearchQuerySet


SearchQuerySet• The reason to use Haystack

• Very powerful

• Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet


SearchQuerySet• Fetches data from the index

• Very similar to QuerySet

• Intentional, to reduce conceptual overhead

• Lazily evaluated

• Chain methods


SearchQuerySet• By default, searches across all models

• Can limit using SearchQuerySet.models

• Caches where possible


SearchQuerySet>>> import datetime>>> from haystack.query import SearchQuerySet>>> sqs = SearchQuerySet().models(Entry)>>> sqs = sqs.filter(created__lte=datetime.datetime.now())>>> sqs = sqs.exclude(author=‘daniel’)

# Lazily performed the query when asked for results.>>> sqs[<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>]

# Iterable interface.# Still hasn’t hit the DB.>>> [result.author for result in sqs][‘johndoe’, ‘sally1982’, ‘bob_the_third’]


SearchQuerySet# Hits the database once per result.>>> [result.object.user.first_name for result in sqs][‘John’, ‘Sally’, ‘Bob’]

# More efficient loading from database (one query total).>>> [result.object.user.first_name for result in sqs.load_all()][‘John’, ‘Sally’, ‘Bob’]


SearchView


SearchView• Class-based view

• Hit 80% of the regular usage

• A guideline to more advanced use

• Relies heavily on SearchForm


SearchForm


SearchForm• Outside of using SearchQuerySet, it’s a

standard Django form

• Defines a search method that does the necessary actions


SearchFormfrom django import formsfrom haystack.forms import SearchFormfrom myapp.models import Entry

class EntrySearchForm(SearchForm):# Additional fields go here.author = forms.CharField(max_length=255, required=False)

def search(self):sqs = super(EntrySearchForm, self).search()

if self.cleaned_data.get(‘author’):sqs = sqs.filter(author=self.cleaned_data[‘author’])

return sqs


SearchSite


SearchSite• Registry pattern

• Collects all registered SearchIndex classes

• Used by SearchQuerySet to limit results to only things Haystack knows about

• Think django.contrib.admin.site.


HaystackBest Practices


Common Fields• Try to find common fields as much as

possible

• Reuse where it makes sense

• But don’t shoehorn if it doesn’t work


It’s Just Python• When an out-of-box doesn't work for you,

use SearchQuerySet & write what you need.

• It's just Django & Python.


load_all• Appropriate use of

SearchQuerySet.load_all

• One hit to the DB per content type

• But do you need to hit the DB?


More Like This• Cheap & very worth it

• LJWorld saw a 30% jump in traffic by adding it solely on story detail views.

• Cache it!


“Third Party” Apps• queued_search

• https://github.com/toastdriven/queued_search

• saved_searches

• https://github.com/toastdriven/saved_searches


https://github.com/toastdriven/queued_search








Other Ideas• Admin Integration

• Integration with API

• Search “grouping”

• Vertical search


SolrBest Practices


Tomcat vs. Jetty• Very close performance-wise

• Tomcat better when busy

• Jetty is smaller on RAM & easier to run


Tune JVM settings-Xms (Minimum size)-Xmx (Maximum size)

# Something close to...- ``java -Xms1G -Xmx12G -jar start.jar``

- -XX:+PrintGCDetails (print GC info)- -XX:+PrintGCTimeStamps (print GC info + timestamps)


JMX Console• java -Dcom.sun.management.jmxremote -jar start.jar

• Then jconsole

• Find jetty in the process list.

• Lots of instrumentation


• Proper query warming

• The default “solr rocks” doesn’t.

• Remove unused handlers (like partition)

Tune solrconfig


Tune solrconfig• Tuning the mergeFactor

• Not too high, not too low

• Big trade-off


Schema• use omitNorms where possible

• Only needed on full-text fields

• Same goes for indexed & stored

• The fewer fields, the better


Optimize!• Seriously.

• Goes back through existing indexes & cleans up

• Takes awhile to run, so make sure your timeout is high (custom settings file)


Commits• Commit as infrequently as is reasonable

• Commit as much as you can at once

• queued_search shines here


Debugging• Use &debugQuery=on to debug queries

• Use the browser interface!


Advanced Bits• Learn & love the Solr stats page



• Replication



• Replication

• n-gram based autocomplete



• Replication


• Spelling suggestions

• the (Haystack) documented config sucks



• Replication


• Spelling suggestions

• the (Haystack) documented config sucks

• Dismax Handler


Resources• https://gist.github.com/215331

• http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

• http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

• http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

• http://wiki.apache.org/solr/SolrJmx

• http://wiki.apache.org/solr/LargeIndexes

• http://wiki.apache.org/solr/SolrPerformanceFactors


https://gist.github.com/215331

https://gist.github.com/215331

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr




http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/




http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

http://wiki.apache.org/solr/SolrJmx

http://wiki.apache.org/solr/SolrJmx

http://wiki.apache.org/solr/LargeIndexes

http://wiki.apache.org/solr/LargeIndexes

http://wiki.apache.org/solr/SolrPerformanceFactors

http://wiki.apache.org/solr/SolrPerformanceFactors

Resources• http://wiki.apache.org/solr/SolrReplication

• http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/

• http://charlesleifer.com/blog/search-on-djangosnippetsorg/

• http://wiki.apache.org/solr/SpellCheckComponent


http://wiki.apache.org/solr/SolrReplication

http://wiki.apache.org/solr/SolrReplication

http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/

http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/

http://charlesleifer.com/blog/search-on-djangosnippetsorg/

http://charlesleifer.com/blog/search-on-djangosnippetsorg/

http://wiki.apache.org/solr/SpellCheckComponent

http://wiki.apache.org/solr/SpellCheckComponent

Enough Talk.Let’s Go Work With It.


A Big Thanks ToCMG Digital &

@cmheisel For Having Me!


http://twitter.com/cmheisel

http://twitter.com/cmheisel

http://haystacksearch.org/http://github.com/toastdriven/django-haystack

#haystack on irc.freenode.nethttp://groups.google.com/group/django-haystack/

@daniellindsley on Twitter

More Information


http://haystacksearch.org

http://haystacksearch.org

http://github.com/toastdriven/django-haystack

http://github.com/toastdriven/django-haystack

http://groups.google.com/group/django-haystack/

http://groups.google.com/group/django-haystack/

Documents

Haystack Training