Search explained T3DD15

Preview:

Citation preview

Search explained

My Name is Hans Höchtl

Technical director @ Onedrop Solutions

PHP, Java, Ruby Developer

Participation in TYPO3 Solr

SELECT * FROM mytable WHERE field LIKE „%searchword%“

SELECT * FROM mytable WHERE field SOUNDS LIKE

„searchword“

Appearance of a word inside a text can be determined easily.

But is it relevant?

Relevance is subjective and depends on the judgement of users.

We use „scoring“ to predict relevance.

Scoring is computed by a function applied on our indexed documents using the search term as input parameter.

TF-IDF Term frequency-inverse document frequency

BM25Okapi BM25 - Best Matching

DFRDivergence from randomness

and many more

All those scoring calculations should fulfill these two requirements:

1. PrecisionAre the results relevant to the user?

2. RecallHave we found all relevant content in the index?

How to store documents for efficient computing of scoring?

Vector Space Model Default in Solr, Elasticsearch

Document: A vector of terms

Term: A „word“ inside a document

Each unique term is a dimension

Vector Space Model

The best match is the narrowest angle between query and document

Document 1

„unique unique bag“

Document 2

„unique bag bag“

Query

unique bagunique

bagv(d1)

v(q)

v(d2)

The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)

Where d2 * q is the intersection (dot product) of the document and the query vectors.

||q|| is the norm vector of q

A cosine value of zero means that the query and document vector are orthogonal and have no match.

TF-IDF

Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as:

Term frequencyInverse document frequency

TF-IDF

Now we have everything together to calculate the similarity between documents using TF-IDF:

TF-IDF

PROs CONs

- Simple model based on linear algebra

- Term weights not binary - Allows computing a

continuous degree of similarity between queries and documents

- Allows ranking of documents according to their possible relevance

- Allows partial matching

- Long documents have poor similarity values (small scalar and large dimensionality)

- Search keywords must precisely match terms

- Missing semantic sensitivity - Order of terms in document

not taken into account - Terms are usually not

statistically independent (as this model states)

TF-IDF - The Lucene way

Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4

Norm: Length normalization boosts fields that are shorter

TF-IDF - Multiple fields

TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.

TF-IDF - Multiple fields

TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.

TF-IDF - Multiple fields

Solr Solution: DisMax Query Parser (Maximum Disjunction)

Searchterm: „my funny house“

Documents matching query in

field title Documents matching

query in field subtitle

Documents matching query in

field content

TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.

Natural languages

Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.

Language families

Indo-European languages

Sino-Tibetan languages

TF-IDF Problem

Only exakt Term matches are considered a hit.

„Car“ is not the same term as „Cars“

Handling human languages (Analyzers)

Tokenizers:Splits a stream of characters into a series of tokens.

Filters:The generated tokens are passed through a series of filters that add, change or remove tokens.

Index Analyzers vs. Query Analyzers

Index Analyzers:Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed.

Query Analyzers:Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match.

Beware of Synonyms!

Available analyzers

Solr (https://goo.gl/TXEjZK) Language best practices (https://goo.gl/11O2Qz)

Elasticsearch (https://goo.gl/QR1IYb) Language best practices (https://goo.gl/6FQt7A)

FieldTypes

Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed

Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch

Let’s test the analyzer chain

Solr and Elasticsearch

Display score calculation

Solr: /solr/core_de/select?q=test&debugQuery=1

Elasticsearch: /_explain instead of /_search

Let’s take a look at0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)

Product-Codes

„AS1134-B“

„131555813“

„EOS 500D“

„13 S24 36-G“

Product-Codes

Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy.

Make use of N-Grams, EdgeN-Grams, WordDelimiter, Trim, etc.

Use the knowledge you gain from your customers to improve your search, … like Google does.

- Use Google Analytics during index time (preAddModifyDocuments hook)

- Use recency of news (boostfunction)

- Analyze the search behavior of your customers (popularity of pages)

- Track search result clicks

Some more interesting thinks

- Facets

- Spellchecking

- Phonetics

- Spatial

Thank you

Mail: hhoechtl@1drop.de or jhoechtl@gmail.comTwitter: @hhoechtlBlog: http://blog.1drop.de

Recommended