Architecture of Search Systems and Measuring the Search Effectiveness

12 april 2023

SEARCH SYSTEM ARCHITECTURE & MEASURING SEARCH

EFFECTIVENESS

Introduction to text mining – Warsaw University of Technology

Plan

Findwise – who we are, what we do.

General architecture of search engines

Data sources

Content processing

Search index

Query and result processing

Security in search engines

Applications based on search

Leading search technologies

The concept of Findability

Differences in online and enterprise search

Measuring of search effectiveness

Questions and answers

Findwise – Search Driven Solutions

• Founded in 2005

• Offices in Sweden, Denmark,

Norway and Poland

• 75+ employees

Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value.

• Paweł Wróblewski – search enthusiast

General architecture of search engines

Important terms

Feeding Indexing Searching

Latency

Data sources

Everything that has an information is a good source!

We need a connector to feed the data into a search system:

Take the content

Take the metadata

Take the security information

Different strategies to feed the data:

Push – external applications invokes search system connector’s API to feed the content (e.g. transactional systems)

Pull – connector periodically scans the source and takes the data (e.g. web crawler, file system)

Hybrid – external systems dumps the data which are pulled by a connector

Content Processing – the idea

Lemmas(tenses, forms)

SpellChecking

SynonymsFormatConversion

LanguageDetection

EntitiesCustomPLUG-IN

TaxonomyClassification

VectorizerGeographyCompaniesPeople

indexScopifier

Document

PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.

The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. Input: byte stream

Output: structured document ready to be indexed

Content Processing – the implementation

Hydra is used in order to refine content before it hits the index. Every

document fetched from a source runs through a targeted pipeline,

which includes a number of stages. A stage can be considered as an

“app” within Appstore or the Android market. Findwise have created

a huge amount of such stages, where each stage has a small

purpose to enhance the content of the item. It is possible to create

additional stages to serve a specific customer functionality.

Hydra - example

Select stages to use in the pipeline, the left column corresponds to the “market”, and the right is the stages used.

Hydra - example

Modify the format of the date to only include year.

Hydra - example

The new year meta-data can be used as a facet

Hydra - example

Map every author field to a metadata field called author.

Pipeline A

Pipeline B

Hydra - example

In the search result…

Search index – the problem

Input: structured document (content + metadata)

Output: binary represenation of inverted index optimised for speed and acuracy

Search index has a flat structure – no internal relations

Usually changes to the index structure require index rebuild (re-indexing)

Search index – the problem

Inverted index

Theory in previous lectures

How to achieve

Petabytes of indexed data

Thousands of queries per second

Thousands of index updates per second

FAST Enterprise

Search Platform –

search cluster example

Indexing / SearchNode 00




Indexing / SearchNode 1M

Indexing / SearchNode 0M

Indexing / SearchNode N0

Indexing / SearchNode NM

…

…

……

...

…

Search Cluster

M

M

M…

Index split

Ind

ex mirror

Search index – the implementation

In order to perform effective updates (index rebuilds) several index

partitions are produced

Small partition rebuilds quickly unlike the big one

Rebuild of larger partition involves merging index from smaller

one(s)

Rebuilds can be triggered by: number or rebuild operations, number

of documents, percent of total volume In

dex

Inde

x

Inde

xInde

x

Query processing

18

Tokenizer Phrasing

Query:

Do you have anLCD monitorunder $900?

Spell-checking

Anti-phrasing

Normalization

NLQLemmasSynonyms Thesaurus PLUG-IN BUY( X )

AdaptiveEvaluation

GeographyModified query

LCD monitorsTFT monitors

Flat TVPlasma TV

Do you have a

Under $900?price < 900

YES!X = LCD monitor

Use “Product” collectionRank profile = “Profit margin”

Result processing

The following issues might apply to results processing:Ranking generation

Factors that can be considered: number of hits, proximity of hits, freshness (date), web measures (e.g. page rank), business and context factors (boosting or blocking)

Search federationIntegration of results from multiple search engines: round robin,

normalized ranks, searchlets (multiple results lists presented in different way).

Security trimmingFiltering out the results that do not match user’s credentials

Last second check

Security in search solution

20

Secure Server Environment

Search Application Security

Content-level Security

Search Based Applications

Search Driven Solutions = Customisation of search system components

Catalogue of Search Based Applications

Corporate Search

• Intranets/portals

• Information gateways

• Expertise location

• ECM repositories

• Collaboration• Knowledge

Management• Enterprise apps

Intelligence System

• Market intelligence

• Customer intelligence

• Surveillance• IP protection• Fraud detection• eDiscovery• Quality

Management• Information risk

management

Database Offloading

• Data warehouse• Data

transformation• Data caches

Commerce Systems

• Search merchandising

• Customer analytics

• Campaign management

• Call centre enablement

• Customer self-service

Media Systems

• Public news syndication

• Mulitmedia search

• Proprietary research and publications

• Libraries

Search subsystem

Data connectors – out of the box, custom made

Repositories – Web, Databases, Files, Enterprise systems

Leading search engine technologies

• HP / Autonomy IDOL

• Microsoft (SharePoint and FAST Search products)

• Google Search Appliance (GSA )

• IBM Content Analytics/OmniFind

• Oracle Secure Enterprise Search/Endeca

• Apache Lucene/Solr (Open source)

• Exalead CloudView

• and more…

http://www.autonomy.com/content/home/index.en.html






Comparison of different technology vendors What is the goal of Enterprise

Findability (EF)?

How should EF improve business?

What user groups are targeted?

What does the users’ want and need?

What information is available and where is it stored?

How should EF be rolled out and governed?

What costs are involved?

Are there any IT strategy considerations?

Vendor mapping provides an answer to which EF platform matches the overall requirements best on the short and long term

Core search technology

Total cost of ownership

Connectivity and security

Usability Vendor capabilities

Findability – what is it?

Business (needs & goals)

Information (quality & structure)

Users (needs & capabilities)

Organisation (ownership & governance)

Use of search technology/platform

SEARCH<simple>

Basic Advanced

Business value gained from search technology Negligible High

– a holistic approach to leverage business value with search technology

Search Technology

Online vs. Enterprise Search

According to Stephen E. Arnold, „The New Landscape of Enterprise

Search”, Pandia, July 2011

Online vs. Enterprise Search

According to Stephen E. Arnold, „The New Landscape of Enterprise

Search”, Pandia, July 2011

Measuring the search effectiveness

Enterprise case

Relevance of search results is highly subjective

Search is highly bound to business otherwise not important to consider

Increase income or reduce costs

Take into consideration all the dimensions of Findability:

Business: Needs & Goals

Users: Needs & Capabilities

Information: Quality & Structure

Organization: Ownership & Governance

Search Technology: correctness of implementation

Tools: reviews, workshops, presentations, strategies drafting, audits etc.


Online case

Relevance of search results is highly subjective

Search is highly bound to business otherwise not important to consider

Increase conversion rate

Verification od search functions and their impact on conversion rate

Make isolated tests per each identified feature

Create a score based on a weighted average


Online case

Example

QUESTIONS?

Paweł Wró[email protected]