Upload
findwise
View
2.722
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course at master grade studies "Introduction to text mining".
Citation preview
12 april 2023
SEARCH SYSTEM ARCHITECTURE & MEASURING SEARCH
EFFECTIVENESS
Introduction to text mining – Warsaw University of Technology
Plan
Findwise – who we are, what we do.
General architecture of search engines
Data sources
Content processing
Search index
Query and result processing
Security in search engines
Applications based on search
Leading search technologies
The concept of Findability
Differences in online and enterprise search
Measuring of search effectiveness
Questions and answers
Findwise – Search Driven Solutions
• Founded in 2005
• Offices in Sweden, Denmark,
Norway and Poland
• 75+ employees
Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value.
• Paweł Wróblewski – search enthusiast
General architecture of search engines
Important terms
Feeding Indexing Searching
Latency
Data sources
Everything that has an information is a good source!
We need a connector to feed the data into a search system:
Take the content
Take the metadata
Take the security information
Different strategies to feed the data:
Push – external applications invokes search system connector’s API to feed the content (e.g. transactional systems)
Pull – connector periodically scans the source and takes the data (e.g. web crawler, file system)
Hybrid – external systems dumps the data which are pulled by a connector
Content Processing – the idea
Lemmas(tenses, forms)
SpellChecking
SynonymsFormatConversion
LanguageDetection
EntitiesCustomPLUG-IN
TaxonomyClassification
VectorizerGeographyCompaniesPeople
indexScopifier
Document
PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. Input: byte stream
Output: structured document ready to be indexed
Content Processing – the implementation
Hydra is used in order to refine content before it hits the index. Every
document fetched from a source runs through a targeted pipeline,
which includes a number of stages. A stage can be considered as an
“app” within Appstore or the Android market. Findwise have created
a huge amount of such stages, where each stage has a small
purpose to enhance the content of the item. It is possible to create
additional stages to serve a specific customer functionality.
Hydra - example
Select stages to use in the pipeline, the left column corresponds to the “market”, and the right is the stages used.
Hydra - example
Modify the format of the date to only include year.
Hydra - example
The new year meta-data can be used as a facet
Hydra - example
Map every author field to a metadata field called author.
Pipeline A
Pipeline B
Hydra - example
In the search result…
Search index – the problem
Input: structured document (content + metadata)
Output: binary represenation of inverted index optimised for speed and acuracy
Search index has a flat structure – no internal relations
Usually changes to the index structure require index rebuild (re-indexing)
Search index – the problem
Inverted index
Theory in previous lectures
How to achieve
Petabytes of indexed data
Thousands of queries per second
Thousands of index updates per second
FAST Enterprise
Search Platform –
search cluster example
Indexing / SearchNode 00
Indexing / SearchNode 01
Indexing / SearchNode 10
Indexing / SearchNode 11
Indexing / SearchNode 1M
Indexing / SearchNode 0M
Indexing / SearchNode N0
Indexing / SearchNode NM
…
…
……
...
…
Search Cluster
M
M
M…
Index split
Ind
ex mirror
Search index – the implementation
In order to perform effective updates (index rebuilds) several index
partitions are produced
Small partition rebuilds quickly unlike the big one
Rebuild of larger partition involves merging index from smaller
one(s)
Rebuilds can be triggered by: number or rebuild operations, number
of documents, percent of total volume In
dex
Inde
x
Inde
xInde
x
Query processing
18
Tokenizer Phrasing
Query:
Do you have anLCD monitorunder $900?
Spell-checking
Anti-phrasing
Normalization
NLQLemmasSynonyms Thesaurus PLUG-IN BUY( X )
AdaptiveEvaluation
GeographyModified query
LCD monitorsTFT monitors
Flat TVPlasma TV
Do you have a
Under $900?price < 900
YES!X = LCD monitor
Use “Product” collectionRank profile = “Profit margin”
Result processing
The following issues might apply to results processing:Ranking generation
Factors that can be considered: number of hits, proximity of hits, freshness (date), web measures (e.g. page rank), business and context factors (boosting or blocking)
Search federationIntegration of results from multiple search engines: round robin,
normalized ranks, searchlets (multiple results lists presented in different way).
Security trimmingFiltering out the results that do not match user’s credentials
Last second check
Security in search solution
20
Secure Server Environment
Search Application Security
Content-level Security
Search Based Applications
Search Driven Solutions = Customisation of search system components
Catalogue of Search Based Applications
Corporate Search
• Intranets/portals
• Information gateways
• Expertise location
• ECM repositories
• Collaboration• Knowledge
Management• Enterprise apps
Intelligence System
• Market intelligence
• Customer intelligence
• Surveillance• IP protection• Fraud detection• eDiscovery• Quality
Management• Information risk
management
Database Offloading
• Data warehouse• Data
transformation• Data caches
Commerce Systems
• Search merchandising
• Customer analytics
• Campaign management
• Call centre enablement
• Customer self-service
Media Systems
• Public news syndication
• Mulitmedia search
• Proprietary research and publications
• Libraries
Search subsystem
Data connectors – out of the box, custom made
Repositories – Web, Databases, Files, Enterprise systems
Leading search engine technologies
• HP / Autonomy IDOL
• Microsoft (SharePoint and FAST Search products)
• Google Search Appliance (GSA )
• IBM Content Analytics/OmniFind
• Oracle Secure Enterprise Search/Endeca
• Apache Lucene/Solr (Open source)
• Exalead CloudView
• and more…
Comparison of different technology vendors What is the goal of Enterprise
Findability (EF)?
How should EF improve business?
What user groups are targeted?
What does the users’ want and need?
What information is available and where is it stored?
How should EF be rolled out and governed?
What costs are involved?
Are there any IT strategy considerations?
Vendor mapping provides an answer to which EF platform matches the overall requirements best on the short and long term
Core search technology
Total cost of ownership
Connectivity and security
Usability Vendor capabilities
Findability – what is it?
Business (needs & goals)
Information (quality & structure)
Users (needs & capabilities)
Organisation (ownership & governance)
Use of search technology/platform
SEARCH<simple>
Basic Advanced
Business value gained from search technology Negligible High
– a holistic approach to leverage business value with search technology
Search Technology
Online vs. Enterprise Search
According to Stephen E. Arnold, „The New Landscape of Enterprise
Search”, Pandia, July 2011
Online vs. Enterprise Search
According to Stephen E. Arnold, „The New Landscape of Enterprise
Search”, Pandia, July 2011
Measuring the search effectiveness
Enterprise case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to consider
Increase income or reduce costs
Take into consideration all the dimensions of Findability:
Business: Needs & Goals
Users: Needs & Capabilities
Information: Quality & Structure
Organization: Ownership & Governance
Search Technology: correctness of implementation
Tools: reviews, workshops, presentations, strategies drafting, audits etc.
Measuring the search effectiveness
Online case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to consider
Increase conversion rate
Verification od search functions and their impact on conversion rate
Make isolated tests per each identified feature
Create a score based on a weighted average
Measuring the search effectiveness
Online case
Example
QUESTIONS?
Paweł Wró[email protected]