Smart Search
and Beyond
Smart Search and Beyond
Who?
Chris Davenport
Production Leadership Team
Smart Search and Beyond
Solving the search problem
Smart Search and Beyond
Old Joomla Search Sucks!Cannot rank by
relevance across content types
Only very crude filtering
Can be slow to search
Smart Search and Beyond
04 Smart Search tips and tricks
03 Smart Search under the hood
Table of Contents
01 Smart Search so far
02 Smart Search in action
05 Smart Search where next?
Smart Search and Beyond
‣ Old Joomla Search
• Introduced in Mambo
• Largely unchanged since
‣ JXTended Finder for Joomla 1.5
‣ Finder Integration Working Group
• Smart Search for Joomla 2.5
‣ Search Working Group
A Short History
Smart Search and Beyond
Smart Search for Joomla 2.5
‣ Separate index
‣ Auto-completion
‣ Facetted search
‣ Relevancy ordering
‣ Did you mean?
‣ ...and more besides
Smart Search and Beyond
04 Smart Search tips and tricks
03 Smart Search under the hood
02 Smart Search in action
01 Smart Search so far
Table of Contents
05 Smart Search where next?
Smart Search and Beyond
Auto-completion
Smart Search and Beyond
Another example
Smart Search and Beyond
Another example
Smart Search and Beyond
04 Smart Search tips and tricks
01 Smart Search so far
Table of Contents
02 Smart Search in action
03 Smart Search under the hood
05 Smart Search where next?
Smart Search and Beyond
Under the hood
Smart Search and Beyond
A problem in two halves
Smart Search and Beyond
First half: Indexing
INDEX
Raw data
Smart Search and Beyond
Second half: Querying
INDEXSearch queries
Searchresults
Smart Search and Beyond
Search resultsSearch results are rendered purely fromdata in the index, not the raw data.
Smart Search and Beyond
Indexing
Smart Search and Beyond
Indexing
FiltrationFiltration
ParsingParsing
TokenisationTokenisation
StemmingStemming
AnalysisAnalysis
Term weightingTerm weighting
ClassificationClassification
Token aggregationToken aggregation
Smart Search and Beyond
Terms index
Smart Search and Beyond
Parsing
‣ Extract plain text from raw data
• HTML, RTF supported out-of-the-box
• PDF, MS Word could be supported
‣ For example, HTML
• Essentially the same as PHP strip_tags
Smart Search and Beyond
Tokenisation
‣ Fold to lowercase
‣ Special handling for plus, dash, comma, dot and quotes
‣ Remove non-alphanumerics
‣ Replace multiple spaces with one space
‣ Special support for Chinese
Smart Search and Beyond
Token aggregation
On a clear disk you can seek forever
onon aon a clear
aa cleara clear disk
clearclear diskclear disk you
diskdisk youdisk you can
youyou canyou can seek
cancan seekcan seek forever
seekseek forever
forever
Smart Search and Beyond
Filtration
‣ “Stop word removal”
• Not removed, just given a low weight
‣ jos_finder_terms_common
‣ English only
• Other languages need to add their common words to the table
Smart Search and Beyond
Stemming
fishing
fished
fish
fisherfish
Smart Search and Beyond
Stemming
‣ “Snowball” is used by default
• Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish
• BUT it requires PHP extension
‣ “English only” uses a pure PHP stemmer
• Recommended for all English sites
Smart Search and Beyond
Morphological analysis
‣ Currently uses Soundex
‣ Not used in search as such
‣ Used for the “Did you mean?” feature
‣ If no search results found, then...
• Match on Soundex code
• Return nearest term/phrase by Levenshtein distance
Smart Search and Beyond
Term weighting
Context MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3
Smart Search and Beyond
Classification
Smart Search and Beyond
Taxonomies
‣ “Content maps” in Administrator
‣ Basis for facetted search
‣ Multi-level taxonomies not fully supported (yet)
Smart Search and Beyond
Taxonomies - drop-downs
Smart Search and Beyond
Taxonomies - checkboxes
Smart Search and Beyond
Taxonomies - links
Smart Search and Beyond
Database ERD
Smart Search and Beyond
Smart Search Plug-ins
/plugins
/content
/finder
/finder
/categories/contacts/content
/newsfeeds/weblinks
/system
/highlight
Smart Search and Beyond
Smart Search Plug-ins
onContentBeforeSaveonContentAfterSave
onContentAfterDeleteonContentChangeState
onCategoryChangeState
onFinderBeforeSaveonFinderAfterSave
onFinderAfterDeleteonFinderChangeState
onFinderCategoryChangeState
content/finder finder/[type]
Smart Search and Beyond
Query parsingURI argument Query string
Terms q=Some+text Some text
Phrases q=”Some+text” “Some text”
Logical operators q=This+and+that This and that
Before a date d1=2012-05-16 before:2012-05-16
After a date d2=2012-05-18 after:2012-05-18
Content type filter t[]=98233 type:Articles
Taxonomy filter t[]=30922 author:Chris Davenport
Static filter f=2
Highlight qh=Some+text
Smart Search and Beyond
Results rendering
‣ com_finder
• search
‣ default.php
‣ form.php
‣ default_results.php
‣ default_result.php
‣ default_[type].php
‣ mod_finder
‣ default.php
Search resultspageSearch resultspage
Search module
For custom types
Smart Search and Beyond
Layout overrides example
Smart Search and Beyond
Alternative override
Smart Search and Beyond
01 Smart Search so far
Table of Contents
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
Tips and tricks
Smart Search and Beyond
Tips and tricks
‣ HTML Parser
• Invalid HTML can confuse the parser
• Invalid UTF8 is ignored
• Text in attributes is ignored
Smart Search and Beyond
When to do a purge
‣ Indexing is incremental so most of the time you don't need to.
‣ Changes to taxonomies that do not involve changes to content items
‣ Changes to term weights
‣ Changing the stemmer
‣ Changes to content items that do not trigger the standard content events
‣ IMPORTANT
• If you have static filters they will be lost when you do a purge.
Smart Search and Beyond
Tuning Smart Search
‣ Use the CLI for indexing
• http://docs.joomla.org/Setting_up_automatic_Smart_Search_indexing
‣ Out of memory issues
• Please report out of memory issues so we can understand them better.
• Reduce batch size
‣ Default is 50. Drop it to 5 or even 1.
• Terms per batch
‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGE
Smart Search and Beyond
01 Smart Search so far
Table of Contents
02 Smart Search in action
03 Smart Search under the hood
05 Smart Search where next?
04 Smart Search tips and tricks
Smart Search and Beyond
Where next?
Smart Search and Beyond
Search Working Group
‣ Meeting at J and Beyond
• 19 May 2012 11:30 AM
‣ Stable ready for merge July 2012
‣ Joomla 3.0 release September 2012
‣ Meeting at Joomla World Conference
• San Jose, California, November 2012
Smart Search and Beyond
Improved language support
‣ Improve common word support
‣ Improve stemmer support
• Native PHP stemmers?
‣ Improve morphological coding
• Non-English alternatives to Soundex
‣ Mixed language content items
• Language tagging of tokens/terms?
Smart Search and Beyond
Other possibilities
‣ Preserve static filters on purge/index
‣ Decouple indexing via message queues
‣ Easier support for range queries
‣ Search logging via JLog
‣ Variable-length token aggregation
‣ Multi-level taxonomies
‣ Add parsers for PDF, MS Word
Smart Search and Beyond
Search API
‣ Very important going forward
‣ Too big a leap for Joomla 3.0
‣ Develop in parallel during 3.x cycle
‣ Use in Smart Search for Joomla 4.0
Smart Search and Beyond
Documentation
http://docs.joomla.org/Category:Smart_Search
Smart Search and Beyond
Questions?
Smart Search and Beyond
Don't forget
Search Working Group Meeting
Saturday 19 May 2012
11:30 AM
Image Credits
Haystack - Mark Duncan CC-BY-SA 2.0 Generichttp://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg
Under the hood - ilovebutter CC-BY 2.0 Generichttp://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg
Child sucking thumb - Thahira CC-BY-SA 3.0 Unportedhttp://commons.wikimedia.org/wiki/File:Sucking_finger.jpg
Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domainhttp://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg
Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generichttp://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg
Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domainhttp://commons.wikimedia.org/wiki/File:Index_Pages.jpg
Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domainhttp://commons.wikimedia.org/wiki/File:20_questions_1954.JPG
Linnaeus taxonomy - Public domainhttp://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png
All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them.