View
216
Download
0
Category
Tags:
Preview:
Citation preview
Divide and Conquer:Challenges in Scaling Federated
Search
Presented by Abe Lederman, President and CTO
Deep Web Technologies, LLC
SearchEngine Meeting 24 April 2006 Boston, MA
SEARCH ALL OF THESE SOURCES
ONE AT A TIME
OR SEARCH THEM ALL AT
ONCE
Finding the Gold Hidden in the World Wide Web
“Google-type” search engines “pan” the surface web for gold
“Deep Web” search engines go mining for gold
Finding the Gold Hidden in the World Wide Web
“Google-type” search engines “pan” the surface web for gold
“Deep Web” search engines go mining for gold
Challenges Overview
• Managing a large number of sources
• Searching a large number of sources in parallel
• Organizing and ranking the results returned
Challenges of Managing Thousands of Data Sources
Locate Reliable Sources
Categorize Sources by Content
Configure Sources for Searching
Maintain Sources
4
Challenges in Searching Thousands of Sources
Automatically Select Sources to Search
Retrieve Results from Cache
5
Perform Many Searches in Parallel
Bring Back Best Results
Source Selection Optimizer
Search Conductor
Source Selection Optimizer
Source
Descriptions Previous Results
Caching of Search ResultsReduces the load (cost) of accessing sources
CHALLENGES
• Requires a large database
• Need to determine how often to update the cache
• Works best with lots of users doing similar searches
We Address Scalability Through a Grid-Based Solution
• Uses open standards (Web Services, WSDL, SOAP, XML)
• Runs on distributed nodes
• Is platform independent (Java based)
• Very flexible, providing a framework for integration of various filtering and analysis tools
Distributing the Workload as Grid Services
Information Services
Filtering Services
Aggregation Services
Presentation Services
A0
A0
A1
IS0
IS2
IS1
IS3
P0
F0
F0
F0
F0
Select sources to search
Can I get more results from “good”
sources?
Enough good
results?
YES
Deliver results to user
YES
NO
NO
Perform Search
Get Next Results
Search Conductor
Searching a large number
of sources can lead to a flood
of results
Challenges in Organizing and Ranking Results
5
Multi-tier Relevance Ranking
User-driven Ranking
Clustering of Results
Multi-tier Relevance Ranking
• QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet
• MetaRank – Ranks results utilizing custom algorithms applied to meta-data
• DeepRank – Downloads and indexes full-text documents
HEAVY LIFTING REQUIRED!
User-driven Ranking
Credibility of sourceDate rangeDocument lengthDocument type
Geographic proximityPopularity of documentReading levelRelevance
Desired: Blending (weighing) of above criteria
Clustering
A Grand Challenge for Federated Search
Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American
Association for the Advancement of Science, February 16-20, 2006.
Mathematician’s Scientific Discovery
Biology Researcher’s
Scientific Discovery
Physics Scientific Discovery
Math Databases:•Research Papers•Correspondence•Conferences
Biology Databases:•Research Papers•Correspondence•Conferences
Physics Databases:•Research Papers•Correspondence•Conferences
Global Discovery
Search Portal
Math Community
Biology Community
Physics Community
Knowledge Diffusion in Action
Grid of Grids
Each circle = a portal with 10-100 sources
End result is thousands of sources in 2
hops
Scaling to the Next Level
Abe Lederman
122 Longview Drive
Los Alamos, NM 87544
abe@deepwebtech.com
www.deepwebtech.com
12
Thank You!
Recommended