JAB2012 Smart Search Presentation

  • View

  • Download

Embed Size (px)


Smart Search and BeyondPresentation given at J and Beyond, Bad Nauheim, Germany, May 2012.


  • 1. Smart Search and Beyond

2. Who?Chris Davenport Production Leadership TeamSmart Search and Beyond 3. Solving the search problemSmart Search and Beyond 4. Old Joomla Search Sucks!Cannot rank by relevance across content typesOnly very crude filteringCan be slow to searchSmart Search and Beyond 5. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond 6. A Short History Old Joomla Search Introduced in Mambo Largely unchanged since JXTended Finder for Joomla 1.5 Finder Integration Working Group Smart Search for Joomla 2.5 Search Working GroupSmart Search and Beyond 7. Smart Search for Joomla 2.5 Separate index Auto-completion Facetted search Relevancy ordering Did you mean? ...and more besidesSmart Search and Beyond 8. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond 9. Auto-completionSmart Search and Beyond 10. Another exampleSmart Search and Beyond 11. Another exampleSmart Search and Beyond 12. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond 13. Under the hoodSmart Search and Beyond 14. A problem in two halvesSmart Search and Beyond 15. First half: Indexing INDEX Raw dataSmart Search and Beyond 16. Second half: Querying SearchINDEX Search queries resultsSmart Search and Beyond 17. Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond 18. IndexingSmart Search and Beyond 19. IndexingParsingStemmingTokenisationAnalysis Token aggregation Term weighting Filtration ClassificationSmart Search and Beyond 20. Terms indexSmart Search and Beyond 21. Parsing Extract plain text from raw data HTML, RTF supported out-of-the-box PDF, MS Word could be supported For example, HTML Essentially the same as PHP strip_tagsSmart Search and Beyond 22. Tokenisation Fold to lowercase Special handling for plus, dash, comma, dot and quotes Remove non-alphanumerics Replace multiple spaces with one space Special support for ChineseSmart Search and Beyond 23. Token aggregationOn a clear disk you can seek foreveron aclearon a a clearclear diskon a clear a clear disk clear disk youdisk youcandisk you you cancan seekdisk you can you can seek can seek foreverseek foreverseek foreverSmart Search and Beyond 24. Filtration Stop word removal Not removed, just given a low weight jos_finder_terms_common English only Other languages need to add their commonwords to the tableSmart Search and Beyond 25. Stemmingfishingfished fishfisherfishSmart Search and Beyond 26. Stemming Snowball is used by default Danish, German, English, Spanish, Finnish,French, Hungarian, Italian, Norwegian, Dutch,Portuguese, Romanian, Russian, Swedish andTurkish BUT it requires PHP extension English only uses a pure PHP stemmer Recommended for all English sitesSmart Search and Beyond 27. Morphological analysis Currently uses Soundex Not used in search as such Used for the Did you mean? feature If no search results found, then... Match on Soundex code Return nearest term/phrase by LevenshteindistanceSmart Search and Beyond 28. Term weightingContext MultiplierTitle 1.7Text0.7Meta1.2Path2.0Miscellaneous 0.3Smart Search and Beyond 29. ClassificationSmart Search and Beyond 30. Taxonomies Content maps in Administrator Basis for facetted search Multi-level taxonomies not fully supported (yet)Smart Search and Beyond 31. Taxonomies - drop-downsSmart Search and Beyond 32. Taxonomies - checkboxesSmart Search and Beyond 33. Taxonomies - linksSmart Search and Beyond 34. Database ERDSmart Search and Beyond 35. Smart Search Plug-ins /plugins /content /finder /system/finder /categories /highlight /contacts/content/newsfeeds /weblinksSmart Search and Beyond 36. Smart Search Plug-inscontent/finder finder/[type]onContentBeforeSave onFinderBeforeSave onContentAfterSaveonFinderAfterSaveonContentAfterDeleteonFinderAfterDelete onContentChangeStateonFinderChangeState onCategoryChangeState onFinderCategoryChangeStateSmart Search and Beyond 37. Query parsingURI argumentQuery stringTerms q=Some+text Some textPhrases q=Some+text Some textLogical operators q=This+and+that This and thatBefore a date d1=2012-05-16 before:2012-05-16After a dated2=2012-05-18 after:2012-05-18Content type filter t[]=98233 type:ArticlesTaxonomy filter t[]=30922 author:Chris DavenportStatic filter f=2Highlight qh=Some+textSmart Search and Beyond 38. Results rendering com_finder searchSearch results default.php page form.php default_results.php default_result.phpFor custom types default_[type].php mod_finder default.php Search moduleSmart Search and Beyond 39. Layout overrides exampleSmart Search and Beyond 40. Alternative overrideSmart Search and Beyond 41. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond 42. Tips and tricksSmart Search and Beyond 43. Tips and tricks HTML Parser Invalid HTML can confuse the parser Invalid UTF8 is ignored Text in attributes is ignoredSmart Search and Beyond 44. When to do a purge Indexing is incremental so most of the time you dont need to. Changes to taxonomies that do not involve changes to content items Changes to term weights Changing the stemmer Changes to content items that do not trigger the standard content events IMPORTANT If you have static filters they will be lost when you do a purge.Smart Search and Beyond 45. Tuning Smart Search Use the CLI for indexing http://docs.joomla.org/Setting_up_automatic_Smart_Search_indexing Out of memory issues Please report out of memory issues so we canunderstand them better. Reduce batch size Default is 50. Drop it to 5 or even 1. Terms per batch Can be increased BUT NEEDS APACHE SERVER CONFIGCHANGESmart Search and Beyond 46. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond 47. Where next?Smart Search and Beyond 48. Search Working Group Meeting at J and Beyond 19 May 2012 11:30 AM Stable ready for merge July 2012 Joomla 3.0 release September 2012 Meeting at Joomla World Conference San Jose, California, November 2012Smart Search and Beyond 49. Improved language support Improve common word support Improve stemmer support Native PHP stemmers? Improve morphological coding Non-English alternatives to Soundex Mixed language content items Language tagging of tokens/terms?Smart Search and Beyond 50. Other possibilities Preserve static filters on purge/index Decouple indexing via message queues Easier support for range queries Search logging via JLog Variable-length token aggregation Multi-level taxonomies Add parsers for PDF, MS WordSmart Search and Beyond 51. Search API Very important going forward Too big a leap for Joomla 3.0 Develop in parallel during 3.x cycle Use in Smart Search for Joomla 4.0Smart Search and Beyond 52. Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond 53. Questions?Smart Search and Beyond 54. Dont forget Search Working Group MeetingSaturday 19 May 201211:30 AMSmart Search and Beyond 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (18991975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbck (1828-1885) and Freningens Boktryckeri, Norrkping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless Ive accidentally missed crediting them.Image Credits