74
Getting The Most Out Of Haystack Daniel Lindsley Pragmatic Badger, LLC Tuesday, December 28, 2010

Haystack Training

Embed Size (px)

DESCRIPTION

From a November 2010 training session with CMG Digital.

Citation preview

Page 1: Haystack Training

Getting The MostOut Of Haystack

Daniel LindsleyPragmatic Badger, LLC

Tuesday, December 28, 2010

Page 2: Haystack Training

Terminology

Tuesday, December 28, 2010

Page 3: Haystack Training

“Engine”• The actual search engine

• Here be interesting computer science problems

• Examples: Solr, Xapian, Whoosh

Tuesday, December 28, 2010

Page 4: Haystack Training

“Document”• A single record in the index

• Usually accompanied by 1+ fields of metadata

• Heavily processed

Tuesday, December 28, 2010

Page 5: Haystack Training

“Corpus”• The collection of indexed documents

• Latin for “body”

Tuesday, December 28, 2010

Page 6: Haystack Training

“Stemming”• Find the root of the word

• Part of the “magic” of search

• More on this later...

Tuesday, December 28, 2010

Page 7: Haystack Training

“Relevance”• A metric of how well a document matches

the query

• Search’s killer feature

• Hard to get 100% right

Tuesday, December 28, 2010

Page 8: Haystack Training

“Faceting”• Count of docs meeting certain criteria

within your result set

• Drill down!

• Think Amazon/eBay

• More on this later...

Tuesday, December 28, 2010

Page 9: Haystack Training

“Boost”• A way to artificially increase the relevance of

document

• Types: Document/Field/Term

Tuesday, December 28, 2010

Page 10: Haystack Training

Introduction toSearch

Tuesday, December 28, 2010

Page 11: Haystack Training

Search != RDBMS• The sooner you get over that, the easier

everything that follows will be.

• Think “document store”.

Tuesday, December 28, 2010

Page 12: Haystack Training

Stemming• Porter-Stemmer or Snowball

• The engine takes terms & hacks them down to the root word.

• Examples:

“testing” ! “test”

“searchers” ! “searcher” ! “search”

Tuesday, December 28, 2010

Page 13: Haystack Training

Inverted Index• The power of the engine starts here

• Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term

...“search”: [3, 104, 238],...

Tuesday, December 28, 2010

Page 14: Haystack Training

Inverted Index• Very fast lookups

• NOT a “contains” or “like” lookup unless you say so (slower)

Tuesday, December 28, 2010

Page 15: Haystack Training

Document Store• Flat structure

• Generally free-form/schema-less

• Easiest to think about each record as a dictionary

• No relations built-in

Tuesday, December 28, 2010

Page 16: Haystack Training

Why custom search?...or...

“Isn’t this what Google is for?”

Tuesday, December 28, 2010

Page 17: Haystack Training

Why custom search?• You control what is (and is not) indexed

Tuesday, December 28, 2010

Page 18: Haystack Training

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

Tuesday, December 28, 2010

Page 19: Haystack Training

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

• Information-specific handling

Tuesday, December 28, 2010

Page 20: Haystack Training

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

• Information-specific handling

• Provide context-specific search

Tuesday, December 28, 2010

Page 21: Haystack Training

Introduction toHaystack

Tuesday, December 28, 2010

Page 22: Haystack Training

What is Haystack?At its simplest, Haystack is an abstraction layer for integrating Django with a search engine.

Tuesday, December 28, 2010

Page 23: Haystack Training

Why Haystack?• Familiar API

• Declarative

• “Looks” like Django

Tuesday, December 28, 2010

Page 24: Haystack Training

Why Haystack?• Pluggable Backends

• Support Solr & Whoosh out of the box, Xapian with a third-party backend (boo GPL!)

• Your code stays the same regardless of backend.

Tuesday, December 28, 2010

Page 25: Haystack Training

Why Haystack?• Advanced Features

• Faceting

• More Like This

• Highlighting

• Boost

Tuesday, December 28, 2010

Page 26: Haystack Training

Why Haystack?• Integration with third-party apps

• No need to fork their code

• Put the indexes in your code & register them

• Applies to django.contrib as well.

Tuesday, December 28, 2010

Page 27: Haystack Training

Why Haystack?• Real Live Documentation™!

• http://docs.haystacksearch.org/dev/

• Test Coverage!

• Decent coverage

• No new commits without tests

Tuesday, December 28, 2010

Page 28: Haystack Training

Enough shameless self-promotion

already!

Tuesday, December 28, 2010

Page 29: Haystack Training

UsingHaystack

Tuesday, December 28, 2010

Page 30: Haystack Training

Two Phase Approach• The “Data In” is SearchIndex

• The “Data Out” is SearchQuerySet

• Note: There’s a disconnect between your database & the search index

Tuesday, December 28, 2010

Page 31: Haystack Training

SearchIndex

Tuesday, December 28, 2010

Page 32: Haystack Training

SearchIndex• Provides the means to get data into the

index

• Something of a cross between a Form (the data preparation aspects) and Model (the persistence)

Tuesday, December 28, 2010

Page 33: Haystack Training

SearchIndexfrom haystack import indexes, sitefrom myapp.models import Entry

class EntrySearchIndex(indexes.SearchIndex):text = indexes.CharField(document=True, use_template=True)author = indexes.CharField(model_attr=‘user__username’)created = indexes.DateTimeField()

def get_queryset(self):return Entry.objects.published()

def prepare_created(self, obj):return obj.pub_date or datetime.datetime.now()

site.register(Entry, EntrySearchIndex)

Tuesday, December 28, 2010

Page 34: Haystack Training

`use_template=True`?• Use Django templates to prep the data

• Example:# search/indexes/myapp/entry_text.txt{{ obj.title }}{{ obj.author.get_full_name }}{{ obj.tease }}{{ obj.content }}

Tuesday, December 28, 2010

Page 35: Haystack Training

SearchQuerySet

Tuesday, December 28, 2010

Page 36: Haystack Training

SearchQuerySet• The reason to use Haystack

• Very powerful

• Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet

Tuesday, December 28, 2010

Page 37: Haystack Training

SearchQuerySet• Fetches data from the index

• Very similar to QuerySet

• Intentional, to reduce conceptual overhead

• Lazily evaluated

• Chain methods

Tuesday, December 28, 2010

Page 38: Haystack Training

SearchQuerySet• By default, searches across all models

• Can limit using SearchQuerySet.models

• Caches where possible

Tuesday, December 28, 2010

Page 39: Haystack Training

SearchQuerySet>>> import datetime>>> from haystack.query import SearchQuerySet>>> sqs = SearchQuerySet().models(Entry)>>> sqs = sqs.filter(created__lte=datetime.datetime.now())>>> sqs = sqs.exclude(author=‘daniel’)

# Lazily performed the query when asked for results.>>> sqs[<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>]

# Iterable interface.# Still hasn’t hit the DB.>>> [result.author for result in sqs][‘johndoe’, ‘sally1982’, ‘bob_the_third’]

Tuesday, December 28, 2010

Page 40: Haystack Training

SearchQuerySet# Hits the database once per result.>>> [result.object.user.first_name for result in sqs][‘John’, ‘Sally’, ‘Bob’]

# More efficient loading from database (one query total).>>> [result.object.user.first_name for result in sqs.load_all()][‘John’, ‘Sally’, ‘Bob’]

Tuesday, December 28, 2010

Page 41: Haystack Training

SearchView

Tuesday, December 28, 2010

Page 42: Haystack Training

SearchView• Class-based view

• Hit 80% of the regular usage

• A guideline to more advanced use

• Relies heavily on SearchForm

Tuesday, December 28, 2010

Page 43: Haystack Training

SearchForm

Tuesday, December 28, 2010

Page 44: Haystack Training

SearchForm• Outside of using SearchQuerySet, it’s a

standard Django form

• Defines a search method that does the necessary actions

Tuesday, December 28, 2010

Page 45: Haystack Training

SearchFormfrom django import formsfrom haystack.forms import SearchFormfrom myapp.models import Entry

class EntrySearchForm(SearchForm):# Additional fields go here.author = forms.CharField(max_length=255, required=False)

def search(self):sqs = super(EntrySearchForm, self).search()

if self.cleaned_data.get(‘author’):sqs = sqs.filter(author=self.cleaned_data[‘author’])

return sqs

Tuesday, December 28, 2010

Page 46: Haystack Training

SearchSite

Tuesday, December 28, 2010

Page 47: Haystack Training

SearchSite• Registry pattern

• Collects all registered SearchIndex classes

• Used by SearchQuerySet to limit results to only things Haystack knows about

• Think django.contrib.admin.site.

Tuesday, December 28, 2010

Page 48: Haystack Training

HaystackBest Practices

Tuesday, December 28, 2010

Page 49: Haystack Training

Common Fields• Try to find common fields as much as

possible

• Reuse where it makes sense

• But don’t shoehorn if it doesn’t work

Tuesday, December 28, 2010

Page 50: Haystack Training

It’s Just Python• When an out-of-box doesn't work for you,

use SearchQuerySet & write what you need.

• It's just Django & Python.

Tuesday, December 28, 2010

Page 51: Haystack Training

load_all• Appropriate use of

SearchQuerySet.load_all

• One hit to the DB per content type

• But do you need to hit the DB?

Tuesday, December 28, 2010

Page 52: Haystack Training

More Like This• Cheap & very worth it

• LJWorld saw a 30% jump in traffic by adding it solely on story detail views.

• Cache it!

Tuesday, December 28, 2010

Page 54: Haystack Training

Other Ideas• Admin Integration

• Integration with API

• Search “grouping”

• Vertical search

Tuesday, December 28, 2010

Page 55: Haystack Training

SolrBest Practices

Tuesday, December 28, 2010

Page 56: Haystack Training

Tomcat vs. Jetty• Very close performance-wise

• Tomcat better when busy

• Jetty is smaller on RAM & easier to run

Tuesday, December 28, 2010

Page 57: Haystack Training

Tune JVM settings-Xms (Minimum size)-Xmx (Maximum size)

# Something close to...- ``java -Xms1G -Xmx12G -jar start.jar``

- -XX:+PrintGCDetails (print GC info)- -XX:+PrintGCTimeStamps (print GC info + timestamps)

Tuesday, December 28, 2010

Page 58: Haystack Training

JMX Console• java -Dcom.sun.management.jmxremote -jar start.jar

• Then jconsole

• Find jetty in the process list.

• Lots of instrumentation

Tuesday, December 28, 2010

Page 59: Haystack Training

• Proper query warming

• The default “solr rocks” doesn’t.

• Remove unused handlers (like partition)

Tune solrconfig

Tuesday, December 28, 2010

Page 60: Haystack Training

Tune solrconfig• Tuning the mergeFactor

• Not too high, not too low

• Big trade-off

Tuesday, December 28, 2010

Page 61: Haystack Training

Schema• use omitNorms where possible

• Only needed on full-text fields

• Same goes for indexed & stored

• The fewer fields, the better

Tuesday, December 28, 2010

Page 62: Haystack Training

Optimize!• Seriously.

• Goes back through existing indexes & cleans up

• Takes awhile to run, so make sure your timeout is high (custom settings file)

Tuesday, December 28, 2010

Page 63: Haystack Training

Commits• Commit as infrequently as is reasonable

• Commit as much as you can at once

• queued_search shines here

Tuesday, December 28, 2010

Page 64: Haystack Training

Debugging• Use &debugQuery=on to debug queries

• Use the browser interface!

Tuesday, December 28, 2010

Page 65: Haystack Training

Advanced Bits• Learn & love the Solr stats page

Tuesday, December 28, 2010

Page 66: Haystack Training

Advanced Bits• Learn & love the Solr stats page

• Replication

Tuesday, December 28, 2010

Page 67: Haystack Training

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

Tuesday, December 28, 2010

Page 68: Haystack Training

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

• Spelling suggestions

• the (Haystack) documented config sucks

Tuesday, December 28, 2010

Page 69: Haystack Training

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

• Spelling suggestions

• the (Haystack) documented config sucks

• Dismax Handler

Tuesday, December 28, 2010

Page 70: Haystack Training

Resources• https://gist.github.com/215331

• http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

• http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

• http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

• http://wiki.apache.org/solr/SolrJmx

• http://wiki.apache.org/solr/LargeIndexes

• http://wiki.apache.org/solr/SolrPerformanceFactors

Tuesday, December 28, 2010

Page 71: Haystack Training

Resources• http://wiki.apache.org/solr/SolrReplication

• http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/

• http://charlesleifer.com/blog/search-on-djangosnippetsorg/

• http://wiki.apache.org/solr/SpellCheckComponent

Tuesday, December 28, 2010

Page 72: Haystack Training

Enough Talk.Let’s Go Work With It.

Tuesday, December 28, 2010

Page 73: Haystack Training

A Big Thanks ToCMG Digital &

@cmheisel For Having Me!

Tuesday, December 28, 2010

Page 74: Haystack Training

http://haystacksearch.org/http://github.com/toastdriven/django-haystack

#haystack on irc.freenode.nethttp://groups.google.com/group/django-haystack/

@daniellindsley on Twitter

More Information

Tuesday, December 28, 2010