Divide and Conquer: Challenges in Scaling Federated Search Presented by Abe Lederman, President and...

Preview:

Citation preview

Divide and Conquer:Challenges in Scaling Federated

Search

Presented by Abe Lederman, President and CTO

Deep Web Technologies, LLC

SearchEngine Meeting 24 April 2006 Boston, MA

SEARCH ALL OF THESE SOURCES

ONE AT A TIME

OR SEARCH THEM ALL AT

ONCE

Finding the Gold Hidden in the World Wide Web

“Google-type” search engines “pan” the surface web for gold

“Deep Web” search engines go mining for gold

Finding the Gold Hidden in the World Wide Web

“Google-type” search engines “pan” the surface web for gold

“Deep Web” search engines go mining for gold

Challenges Overview

• Managing a large number of sources

• Searching a large number of sources in parallel

• Organizing and ranking the results returned

Challenges of Managing Thousands of Data Sources

Locate Reliable Sources

Categorize Sources by Content

Configure Sources for Searching

Maintain Sources

4

Challenges in Searching Thousands of Sources

Automatically Select Sources to Search

Retrieve Results from Cache

5

Perform Many Searches in Parallel

Bring Back Best Results

Source Selection Optimizer

Search Conductor

Source Selection Optimizer

Source

Descriptions Previous Results

Caching of Search ResultsReduces the load (cost) of accessing sources

CHALLENGES

• Requires a large database

• Need to determine how often to update the cache

• Works best with lots of users doing similar searches

We Address Scalability Through a Grid-Based Solution

• Uses open standards (Web Services, WSDL, SOAP, XML)

• Runs on distributed nodes

• Is platform independent (Java based)

• Very flexible, providing a framework for integration of various filtering and analysis tools

Distributing the Workload as Grid Services

Information Services

Filtering Services

Aggregation Services

Presentation Services

A0

A0

A1

IS0

IS2

IS1

IS3

P0

F0

F0

F0

F0

Select sources to search

Can I get more results from “good”

sources?

Enough good

results?

YES

Deliver results to user

YES

NO

NO

Perform Search

Get Next Results

Search Conductor

Searching a large number

of sources can lead to a flood

of results

Challenges in Organizing and Ranking Results

5

Multi-tier Relevance Ranking

User-driven Ranking

Clustering of Results

Multi-tier Relevance Ranking

• QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet

• MetaRank – Ranks results utilizing custom algorithms applied to meta-data

• DeepRank – Downloads and indexes full-text documents

HEAVY LIFTING REQUIRED!

User-driven Ranking

Credibility of sourceDate rangeDocument lengthDocument type

Geographic proximityPopularity of documentReading levelRelevance

Desired: Blending (weighing) of above criteria

Clustering

A Grand Challenge for Federated Search

Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American

Association for the Advancement of Science, February 16-20, 2006.

Mathematician’s Scientific Discovery

Biology Researcher’s

Scientific Discovery

Physics Scientific Discovery

Math Databases:•Research Papers•Correspondence•Conferences

Biology Databases:•Research Papers•Correspondence•Conferences

Physics Databases:•Research Papers•Correspondence•Conferences

Global Discovery

Search Portal

Math Community

Biology Community

Physics Community

Knowledge Diffusion in Action

Grid of Grids

Each circle = a portal with 10-100 sources

End result is thousands of sources in 2

hops

Scaling to the Next Level

Abe Lederman

122 Longview Drive

Los Alamos, NM 87544

abe@deepwebtech.com

www.deepwebtech.com

12

Thank You!

Recommended