Upload
aldona
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
When worlds collide Metasearching meets central indexes. Mike Taylor – [email protected] Index Data – http://indexdata.com/. Search. When worlds collide : metasearching and central indexes Mike Taylor – [email protected]. Search. - PowerPoint PPT Presentation
Citation preview
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – [email protected]
Index Data – http://indexdata.com/
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data
Problem solved!
Search
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData Data
? ?
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
360 SearchEHIS (EBSCO)MetaLib
Pazpar2(Open source)
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
A.K.A. distributed search
Searching
Metasearch
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
A.K.A. federated search
A.K.A
. bro
adcast
searc
h
A.K.A. distributed search
Searching
?
Back tothe sadsearcher
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData Data
? ?
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
SummonWorldCatPrimo Central
MasterKey
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local index
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local indexA.K.A. discovery services
Centralindex
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
DataData DataData
Fat database
Harvesting
A.K.A. local index
A.K.A
. verti
cal s
earch
A.K.A. discovery services
?
We need a controlled vocabulary!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Metasearch= Federated search= Distributed search= Broadcast search
Central index= Local index= Discovery services= Vertical search (if you ever heard anything so dumb)
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Central indexing compared with metasearching:
- requires harvesting infrastructure- requires lots of local storage- requires co-operation from services to be harvested- does not have access to all searchable data- will always be somewhat out of date- is faster at search time (or SHOULD be)- allows data to be normalised (e.g. dates extracted)- allows for better relevance ranking- can provide pre-baked facets- may have access to some data that not searchable
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Which approach is better?
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Let's do both!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! “Integrated Search”
Metasearchhides thecomplexity
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
Nine tenths underThe surface
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Metasearch
What you seelooks beautiful
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
Problems that need solving
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
A. Problems with pure metasearching
B. How those problems change when you add a central index
Problems with metasearching
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Examples based on Index Data's suite:
Pazpar2 is a free metasearching engine with a stupid name
http://indexdata.com/pazpar2/
MasterKey is a non-open suite that wraps ithttp://indexdata.com/masterkey/
MasterKey is only one way to use Pazpar2
Also integrated into other vendors' UIs.
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Problems with metasearching#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Must be made available via a standard protocol
Option 1: build a gateway in Perlhttp://indexdata.com/simpleserver/
Option 2: MasterKey Connect (non-open)http://indexdata.com/connector-framework
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
Problems with metasearching#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Catalogs searchable using ANSI/NISO Z39.50
Support is very nominal in some cases
IRSpy probes behaviourhttp://irspy.indexdata.com
MasterKey target profiles describe behaviour
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Problems with metasearching#3: Data servers don't support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2 does its own relevance ranking
(Part of merging/deduplication)
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Problems with metasearching#4: Data servers don't return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2 calculates its own facets
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
There isa lot ofmagic in themagic boxSearchingSortingMergingDeduplicationRelevanceFacet generationTime travel...
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Pazpar2
DataData DataData
Remember, ourengine is free:
http://indexdata.com/pazpar2/
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
! What happenswhen we adda central index?
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
You can't harvest Google
Problems with integrated search#1: No data server at all!
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Data is often only in a user-facing Web UI
You can't harvest Google
You just can't
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Repositories harvestable using OAI-PMH
(an even worse name than pazpar2)
Support is very nominal in some cases
Problems with integrated search#2: data server is crap^H^H^H^Hsuboptimal
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Repositories harvestable using OAI-PMH (an even worse name than pazpar2)
Support is very nominal in some cases
OAI-PMH client must be very tolerant
Extensive data-cleaning is usually required
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Returned records carry relevance scores
Must be merged with records scored by engine
Requires score normalisation into same range
Existing ordering may be used in merge
Problems with integrated search#3: Central index does support relevance
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Unranked#1
Ranked#1
Ranked#2
Solr
Sort
MergedUnranked#2 Sort
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Lists of field values with occurrence counts:
AuthorKernighan 27Pike 13Ritchie 7Thompson 4
TitleC 7Unix 35Programming 16
Date1977 51978 41979 21981 2
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Lists are returned or calculated for each server:
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(1000 hits, 100 records)
Cat 7Dog 10Dinosaur 87Fish 23
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Metasearched counts normalised by total hit-count
Server 1 (central index)(all facets from 2000 hits)
Cat 68Dinosaur 162Fish 145Frog 19
Server 2 (metasearch)(normalised to 1000 hits)
Cat 70Dog 100Dinosaur 870Fish 230
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Facet lists are merged
Servers 1+2 (integrated)(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Problems with integrated search#4: Central index does return facets
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Fringe benefit: facet-count normalisation is alsouseful when doing pure metasearching.
Servers 1+2(as though for all records in result sets)
Cat 68+70 = 138Dog 0+100 = 100Dinosaur 162+870 = 1032Fish 145+230 = 375Frog 19+0 = 19
Summary of search issues
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Issue Metasearchsolution
Central indexsolution
No data serverBuild gatewaysMasterKey Connect
---
Bad data server Probe capabilitiesProfile targets
Tolerant harvesterData-cleaning
Relevance scores Magic engineNormalise scores Ingest from server
Facets Magic engineNormalise counts Ingest from server
When worlds collide: metasearching and central indexes Mike Taylor – [email protected]
Magic box
DataData DataData
Searching
DataData DataData
Fat database
Harvesting
When worlds collide
Metasearching meetscentral indexes
Mike Taylor – [email protected]
Index Data – http://indexdata.com/