Building an easy to use search solution (for different languages) Ivo Lukač @ J.Boye Aarhus 13: Web...

Preview:

DESCRIPTION

 

Citation preview

Building an easy to use search solution

(for different languages)

Ivo Lukač @ J.Boye Aarhus 13: Web & Intranet Conference !

“Making search work” track

www.netgenlabs.com

Speaker

•Co-owner of Netgen - web development agency, Zagreb, Croatia

•Started as developer 11 years ago

•Now I do variety of things, but can be best described as International Business Developer

www.netgenlabs.com

So I am still a developer! :)

www.netgenlabs.com

Use case

•Regulatory reform project: cutting of unneeded legislative, laws and/or procedures

•Netgen is the technology implementation partner

•Project lead by Sense Consulting

•Croatia, Egypt, Vietnam, Armenia, Iraq - mostly “exotic” countries

www.netgenlabs.com

We would rather work in Denmark, but seems that it doesn’t need such a solution :(

How we use search

www.netgenlabs.com

Solution

•In 2006. simple filter

•Today eZ Publish CMS powered flexible information architecture with Solr for search

•Usually 70% common features, 30% customisation

•Aiming for 90%/10%

•If you interested in tech specifics ask me later…

www.netgenlabs.com

Search features

• Simple (default) and advanced search (with filters)

• Full text search on complex data, boosting on attribute level

• Filtering with multilevel tags/taxonomies

• Stopwords

• Search time spelling based on indexed data

• Sometimes using faceting on result set

www.netgenlabs.com

Additional features

•Sometimes using multi search

•Typing suggestions

•Latest search phrase list

Challenges

www.netgenlabs.com

Characters

•At the beginning we didn’t have Unicode - it was a mess!

•Unicode solved a lot of problems but not all

•Same characters can have more byte codes which is not being normalised by default

www.netgenlabs.com

Indexing

•Indexing files like Word, PDF or similar proved to be problematic due to character problems

•token delimiter configuration could be language specific

•stemming sometimes supported, sometimes not

www.netgenlabs.com

Searching

•search phrase input problems

www.netgenlabs.com

Blind work

•the biggest challenge is that developers don’t know the language

•first level of testing is very hard

•still can’t trust Google Translate

www.netgenlabs.com

What vehicle would you use to transport 10 cases of Heineken?

How to overcome this?

www.netgenlabs.com

Main idea

•lets try to assess search result quality

•use editors for rating (not the public)

•use most frequently searched terms (we can’t test all)

•rate results above the fold

www.netgenlabs.com

The tool

•integrated in the public site

•added thumbs up/down buttons for first X results and only shown to editors

www.netgenlabs.com

Demo

•imported articles to test instance form various sources about CMS topic

•rating result quality of 7 search terms

•Thumbs up/down for suggested 3 search results

•Test periods are used for framing test data

Rating side

Analysing side

www.netgenlabs.com

Rate measures

•Discounted Cumulative Gain (DCG) - rate sum discounted based on position in search results

•Normalised Discounted Cumulative Gain (NDCG) - discounted rate sum normalised against best possible outcome (to get percentage as the unit)

•Popularity based NDCG - takes into account the popularity of the search form

http://en.wikipedia.org/wiki/Discounted_cumulative_gain

www.netgenlabs.com

Known problems

•What if good results are not showing? - something bad is going on with the search engine

•what if there is no good result?

•what about new content added in time?

•at the end of the day measurements are good for comparing between test periods, not meaningful by itself

www.netgenlabs.com

Improvements

•opening rating to public users

•using clicks as rates

•implement “did you find what you have looking for?” feature

•integrate with analytics

•use rate data to boost particular item in search!

Questions now or later

ivo@netgen.hr ilukac.com/twitter ilukac.com/facebook ilukac.com/gplus ilukac.com/linkedin

Recommended