Full Text Search on Google App Enginewith BigTable Search
at Austin PUG February 10, 2010
Percy Wegmann presenting
The Problem
You want to be able to do full-text search (you know, like on Google.com)
Against data stored in a Python Google App Engine application
Without using an external server/service
Full-text Search Basic Features
Let's say that you want to search a repository of 2 documents containing the following text:
swan lake performed live at the Met
swans are crowding ducks out of the local lake
A basic search engine should respond to queries as follows:
swan lake - returns both documents (inexact matching)swan dive - returns both documents (boolean OR matching)swan lake duck - returns document 2 first (ranking)crowds - returns document 2 (stemming)of the - returns neither (stopword removal)
And it should do all of this quickly
More Advanced Features
Starts-with matching (for type-ahead completion)
Indexing of non-text fields (numeric, datetime, references, etc.)
Term weighting (e.g. rank matches on title higher than on body)
Faceted Search (like Amazon or Cnet.com)
Background indexing (to speed up inserts)
Thesaurus (mallard would match duck)
Phrase matching (exact phrases rank higher than disjointed combinations of words)
The Contenders
Stemming & Stopword RemovalBoolean ORRanking
Datastore QuerySearchableModelstopword removal onlyBill Katz' Searchablexgae-searchxBigTable Searchxxx
What The Others Are Missing
Boolean OR/Ranking Makes multi-term queries almost pointless
Faceted Search Users are accustomed to this from sites like Amazon
Scalability No one uses inverted indexes!
Introducing BigTable Search
Switch to demo
How it Works Inverted Index
Index is organized by search term. This is how the big boys (Lucene, Sphinx, etc.) do it.
Example from Wikipedia
Documentsit is what it is
what is it
it is a banana
Index (stores pointers to documents)a: {3}banana: {3}is: {1, 2, 3}it: {1, 2, 3}what: {1, 2}
To search for it, we only have to grab a single row from the index yielding {1, 2, 3}
To search for what or banana we grab two rows and take the union, yielding {1, 2, 3}
To search for what and banana we grab two rows and take intersection, yielding {}
To rank a search banana or it we take union and count occurrences, yielding {3, 1, 2}
The Pain of Updating
Remember our documents:
Documentsit is what it is
what is it
it is a banana
To add the first document, we have to update 4 index entries. The bigger the documents get, the worse it gets.
Worse, multiple documents are represented in a single index entry, so concurrency becomes a problem too try locking on the index entry for the, and your entire system becomes effectively single-threaded!
The Solution to Updating
Asynchronous Updates
DataStoredoc1.1: put
calc queue1.2: requestindexingmerge queuemerge queuemerge queue
2: queue terms
3: merge toinverted index
Code (at a Glance)
Data Model
Queues
Code
The Better Answer?
BigTable Search suffers from some significant limitations:
- Fast search engines use custom file storage formats for performance, BigTable Search does not have this option and is consequently not fast- No phrase matching- No synonym or semantic matching
Google is working on a full-text search solution(feature 217 on Issues List, In Progress, no ETA, session scheduled for Google I/O in May)
Resources
pyporter2 (used by BigTable Search and others for stemming)http://github.com/mdirolf/pyporter2
SearchableModelhttp://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/__init__.py
Bill Katz' Simple Full-text Search for App Enginehttp://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine
gae-searchhttp://gae-full-text-search.appspot.com/
BigTable Searchhttp://code.google.com/p/bigtablesearch/
Google's Upcoming Full-text Search (feature 217)http://code.google.com/p/googleappengine/issues/detail?id=217