Transcript
Page 1: JAB2012 Smart Search Presentation

Smart Search

and Beyond

Page 2: JAB2012 Smart Search Presentation

Smart Search and Beyond

Who?

Chris Davenport

Production Leadership Team

Page 3: JAB2012 Smart Search Presentation

Smart Search and Beyond

Solving the search problem

Page 4: JAB2012 Smart Search Presentation

Smart Search and Beyond

Old Joomla Search Sucks!Cannot rank by

relevance across content types

Only very crude filtering

Can be slow to search

Page 5: JAB2012 Smart Search Presentation

Smart Search and Beyond

04 Smart Search tips and tricks

03 Smart Search under the hood

Table of Contents

01 Smart Search so far

02 Smart Search in action

05 Smart Search where next?

Page 6: JAB2012 Smart Search Presentation

Smart Search and Beyond

‣ Old Joomla Search

• Introduced in Mambo

• Largely unchanged since

‣ JXTended Finder for Joomla 1.5

‣ Finder Integration Working Group

• Smart Search for Joomla 2.5

‣ Search Working Group

A Short History

Page 7: JAB2012 Smart Search Presentation

Smart Search and Beyond

Smart Search for Joomla 2.5

‣ Separate index

‣ Auto-completion

‣ Facetted search

‣ Relevancy ordering

‣ Did you mean?

‣ ...and more besides

Page 8: JAB2012 Smart Search Presentation

Smart Search and Beyond

04 Smart Search tips and tricks

03 Smart Search under the hood

02 Smart Search in action

01 Smart Search so far

Table of Contents

05 Smart Search where next?

Page 9: JAB2012 Smart Search Presentation

Smart Search and Beyond

Auto-completion

Page 10: JAB2012 Smart Search Presentation

Smart Search and Beyond

Another example

Page 11: JAB2012 Smart Search Presentation

Smart Search and Beyond

Another example

Page 12: JAB2012 Smart Search Presentation

Smart Search and Beyond

04 Smart Search tips and tricks

01 Smart Search so far

Table of Contents

02 Smart Search in action

03 Smart Search under the hood

05 Smart Search where next?

Page 13: JAB2012 Smart Search Presentation

Smart Search and Beyond

Under the hood

Page 14: JAB2012 Smart Search Presentation

Smart Search and Beyond

A problem in two halves

Page 15: JAB2012 Smart Search Presentation

Smart Search and Beyond

First half: Indexing

INDEX

Raw data

Page 16: JAB2012 Smart Search Presentation

Smart Search and Beyond

Second half: Querying

INDEXSearch queries

Searchresults

Page 17: JAB2012 Smart Search Presentation

Smart Search and Beyond

Search resultsSearch results are rendered purely fromdata in the index, not the raw data.

Page 18: JAB2012 Smart Search Presentation

Smart Search and Beyond

Indexing

Page 19: JAB2012 Smart Search Presentation

Smart Search and Beyond

Indexing

FiltrationFiltration

ParsingParsing

TokenisationTokenisation

StemmingStemming

AnalysisAnalysis

Term weightingTerm weighting

ClassificationClassification

Token aggregationToken aggregation

Page 20: JAB2012 Smart Search Presentation

Smart Search and Beyond

Terms index

Page 21: JAB2012 Smart Search Presentation

Smart Search and Beyond

Parsing

‣ Extract plain text from raw data

• HTML, RTF supported out-of-the-box

• PDF, MS Word could be supported

‣ For example, HTML

• Essentially the same as PHP strip_tags

Page 22: JAB2012 Smart Search Presentation

Smart Search and Beyond

Tokenisation

‣ Fold to lowercase

‣ Special handling for plus, dash, comma, dot and quotes

‣ Remove non-alphanumerics

‣ Replace multiple spaces with one space

‣ Special support for Chinese

Page 23: JAB2012 Smart Search Presentation

Smart Search and Beyond

Token aggregation

On a clear disk you can seek forever

onon aon a clear

aa cleara clear disk

clearclear diskclear disk you

diskdisk youdisk you can

youyou canyou can seek

cancan seekcan seek forever

seekseek forever

forever

Page 24: JAB2012 Smart Search Presentation

Smart Search and Beyond

Filtration

‣ “Stop word removal”

• Not removed, just given a low weight

‣ jos_finder_terms_common

‣ English only

• Other languages need to add their common words to the table

Page 25: JAB2012 Smart Search Presentation

Smart Search and Beyond

Stemming

fishing

fished

fish

fisherfish

Page 26: JAB2012 Smart Search Presentation

Smart Search and Beyond

Stemming

‣ “Snowball” is used by default

• Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish

• BUT it requires PHP extension

‣ “English only” uses a pure PHP stemmer

• Recommended for all English sites

Page 27: JAB2012 Smart Search Presentation

Smart Search and Beyond

Morphological analysis

‣ Currently uses Soundex

‣ Not used in search as such

‣ Used for the “Did you mean?” feature

‣ If no search results found, then...

• Match on Soundex code

• Return nearest term/phrase by Levenshtein distance

Page 28: JAB2012 Smart Search Presentation

Smart Search and Beyond

Term weighting

Context MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3

Page 29: JAB2012 Smart Search Presentation

Smart Search and Beyond

Classification

Page 30: JAB2012 Smart Search Presentation

Smart Search and Beyond

Taxonomies

‣ “Content maps” in Administrator

‣ Basis for facetted search

‣ Multi-level taxonomies not fully supported (yet)

Page 31: JAB2012 Smart Search Presentation

Smart Search and Beyond

Taxonomies - drop-downs

Page 32: JAB2012 Smart Search Presentation

Smart Search and Beyond

Taxonomies - checkboxes

Page 33: JAB2012 Smart Search Presentation

Smart Search and Beyond

Taxonomies - links

Page 34: JAB2012 Smart Search Presentation

Smart Search and Beyond

Database ERD

Page 35: JAB2012 Smart Search Presentation

Smart Search and Beyond

Smart Search Plug-ins

/plugins

/content

/finder

/finder

/categories/contacts/content

/newsfeeds/weblinks

/system

/highlight

Page 36: JAB2012 Smart Search Presentation

Smart Search and Beyond

Smart Search Plug-ins

onContentBeforeSaveonContentAfterSave

onContentAfterDeleteonContentChangeState

onCategoryChangeState

onFinderBeforeSaveonFinderAfterSave

onFinderAfterDeleteonFinderChangeState

onFinderCategoryChangeState

content/finder finder/[type]

Page 37: JAB2012 Smart Search Presentation

Smart Search and Beyond

Query parsingURI argument Query string

Terms q=Some+text Some text

Phrases q=”Some+text” “Some text”

Logical operators q=This+and+that This and that

Before a date d1=2012-05-16 before:2012-05-16

After a date d2=2012-05-18 after:2012-05-18

Content type filter t[]=98233 type:Articles

Taxonomy filter t[]=30922 author:Chris Davenport

Static filter f=2

Highlight qh=Some+text

Page 38: JAB2012 Smart Search Presentation

Smart Search and Beyond

Results rendering

‣ com_finder

• search

‣ default.php

‣ form.php

‣ default_results.php

‣ default_result.php

‣ default_[type].php

‣ mod_finder

‣ default.php

Search resultspageSearch resultspage

Search module

For custom types

Page 39: JAB2012 Smart Search Presentation

Smart Search and Beyond

Layout overrides example

Page 40: JAB2012 Smart Search Presentation

Smart Search and Beyond

Alternative override

Page 41: JAB2012 Smart Search Presentation

Smart Search and Beyond

01 Smart Search so far

Table of Contents

02 Smart Search in action

03 Smart Search under the hood

04 Smart Search tips and tricks

05 Smart Search where next?

Page 42: JAB2012 Smart Search Presentation

Smart Search and Beyond

Tips and tricks

Page 43: JAB2012 Smart Search Presentation

Smart Search and Beyond

Tips and tricks

‣ HTML Parser

• Invalid HTML can confuse the parser

• Invalid UTF8 is ignored

• Text in attributes is ignored

Page 44: JAB2012 Smart Search Presentation

Smart Search and Beyond

When to do a purge

‣ Indexing is incremental so most of the time you don't need to.

‣ Changes to taxonomies that do not involve changes to content items

‣ Changes to term weights

‣ Changing the stemmer

‣ Changes to content items that do not trigger the standard content events

‣ IMPORTANT

• If you have static filters they will be lost when you do a purge.

Page 45: JAB2012 Smart Search Presentation

Smart Search and Beyond

Tuning Smart Search

‣ Use the CLI for indexing

• http://docs.joomla.org/Setting_up_automatic_Smart_Search_indexing

‣ Out of memory issues

• Please report out of memory issues so we can understand them better.

• Reduce batch size

‣ Default is 50. Drop it to 5 or even 1.

• Terms per batch

‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGE

Page 46: JAB2012 Smart Search Presentation

Smart Search and Beyond

01 Smart Search so far

Table of Contents

02 Smart Search in action

03 Smart Search under the hood

05 Smart Search where next?

04 Smart Search tips and tricks

Page 47: JAB2012 Smart Search Presentation

Smart Search and Beyond

Where next?

Page 48: JAB2012 Smart Search Presentation

Smart Search and Beyond

Search Working Group

‣ Meeting at J and Beyond

• 19 May 2012 11:30 AM

‣ Stable ready for merge July 2012

‣ Joomla 3.0 release September 2012

‣ Meeting at Joomla World Conference

• San Jose, California, November 2012

Page 49: JAB2012 Smart Search Presentation

Smart Search and Beyond

Improved language support

‣ Improve common word support

‣ Improve stemmer support

• Native PHP stemmers?

‣ Improve morphological coding

• Non-English alternatives to Soundex

‣ Mixed language content items

• Language tagging of tokens/terms?

Page 50: JAB2012 Smart Search Presentation

Smart Search and Beyond

Other possibilities

‣ Preserve static filters on purge/index

‣ Decouple indexing via message queues

‣ Easier support for range queries

‣ Search logging via JLog

‣ Variable-length token aggregation

‣ Multi-level taxonomies

‣ Add parsers for PDF, MS Word

Page 51: JAB2012 Smart Search Presentation

Smart Search and Beyond

Search API

‣ Very important going forward

‣ Too big a leap for Joomla 3.0

‣ Develop in parallel during 3.x cycle

‣ Use in Smart Search for Joomla 4.0

Page 52: JAB2012 Smart Search Presentation

Smart Search and Beyond

Documentation

http://docs.joomla.org/Category:Smart_Search

Page 53: JAB2012 Smart Search Presentation

Smart Search and Beyond

Questions?

Page 54: JAB2012 Smart Search Presentation

Smart Search and Beyond

Don't forget

Search Working Group Meeting

Saturday 19 May 2012

11:30 AM

Page 55: JAB2012 Smart Search Presentation

Image Credits

Haystack - Mark Duncan CC-BY-SA 2.0 Generichttp://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg

Under the hood - ilovebutter CC-BY 2.0 Generichttp://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg

Child sucking thumb - Thahira CC-BY-SA 3.0 Unportedhttp://commons.wikimedia.org/wiki/File:Sucking_finger.jpg

Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domainhttp://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg

Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generichttp://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg

Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domainhttp://commons.wikimedia.org/wiki/File:Index_Pages.jpg

Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domainhttp://commons.wikimedia.org/wiki/File:20_questions_1954.JPG

Linnaeus taxonomy - Public domainhttp://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png

All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them.