DELIS Kickoff, March 18-19, 2004 1 Outline 1 Overriding Goals & Structure 2 Background on P2P 3 Background on Search Engines 4 P2P SE Architecture 5 Research

DELIS Kickoff, March 18-19, 2004 1

Outline1 Overriding Goals & Structure2 Background on P2P3 Background on Search Engines4 P2P SE Architecture5 Research Challenges & Opportunities

DELIS SP6: Data Management, Search, and Mining on

Internet-scale Dynamically Evolving Peer-to-Peer Networks

Gerhard Weikum (MPII)


1) Overriding Goals and Structure of SP6Vision: Decentralized „Ultimate Google“ + Collaborative data mining on evolving large-scale data

Challenges:• Self-organizing P2P system with unlimited scalability• Leveraging intellectual input from millions of peers (bookmarks, link evolution, click streams, etc.) for better search quality

Layered approach:

P2P Web Search Engine

Enhanced DHT

DisseminationIncentives

MiningCollaboration

SP1, SP2 SP3 SP4SP5 ...


2) Peer-to-Peer (P2P) Architectures

Decentralized, self-organizing, highly dynamicloose coupling of many autonomous computers

Applications:• Large-scale distributed computation (SETI, PrimeNumbers, etc.)• File sharing (Napster, Gnutella, KaZaA, etc.)• Publish-Subscribe Information Sharing (Marketplaces, etc.)• Collaborative Work (Games, etc.)• Collaborative Data Mining• (Collaborative) Web Search

Goals:• make systems ultra-scalable and completely self-organizing• make complex systems manageable and less susceptible to attacks• break information monopolies, exploit small-world phenomenon


Unstructured P2P: Example Gnutella

1) contact neighborhood and establish virtualtopology (on-demand + periodically): Ping, Pong

2) search file: Query, QueryHit3) download file: Get or Push (behind firewall)

1

1

2

22

2

2

all forward messages carry a TTL tag (time-to-live)

3

3

3

33


Structured P2P: Example Chord

Properties & claims:Unlimited scalability (> 106 nodes)O(log n) hops to target, O(log n) state per nodeSelf-stabilization (many failures, high dynamics)

Distributed Hash Table (DHT):map strings (file names, keywords) and numbers (IP addresses)onto very large „cyclic“ key space 0..2m-1, the so-called Chord Ring

Key k (e.g., hash(file name))is assigned to the node withkey n (e.g., hash(IP address))such that k n and there isno node n‘ with k n‘ and n‘<n

N1

N8

N14

N21

N32N38

N42

N48

N51

N56

K10

K24

K30K38

K54


Every node knows its pred/succ and has a finger table with log(n)pointers: finger[i] = successor (node number + 2i-1) for i=1..m

Request Routing in Chord

For finding key kperform repeatedly:determine current node‘slargest finger[i] (modulo 2m)with finger[i] k

pred/succ ring and finger tablesrequire dynamic maintenance stabilization protocol

N48

N1N8

N14

N21

N32N38

N42

N51

N56K54

lookup (K54)

+1

+8

+16+32

N42+1 N48N42+2 N48N42+4 N48N42+8 N51N42+16 N1N42+32 N14

N8+1 N14N8+2 N14N8+4 N14N8+8 N21N8+16 N32N8+32 N42

Finger table:


3) Background on Search Engines

Documents

Web Surfing:In Internet cafes with or withoutWeb Suit ...

SurfingInternetCafes...

Extractionof relevantwords

SurfInternetCafe...

Linguisticmethods:stemming

SurfWaveInternetWWWeServiceCafeBistro...

Constructionof weightedfeatures(terms)

Index(B+-tree)

Bistro Cafe ...URLs

Indexing

Thesaurus(Ontology)

Synonyms,Sub-/Super-Concepts

WWW......................

Crawling


Ranking bydescendingrelevance

Vector Space Model for Content Relevance

Search engine

Query (set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

||]1,0[ Fq

|F|

1j

2j

|F|

1j

2ij

|F|

1jjij

i

qd

qd

:)q,d(sim

Similarity metric:


Vector Space Model for Content Relevance

Search engine

Query (Set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

||]1,0[ Fq

|F|

1j

2j

|F|

1j

2ij

|F|

1jjij

i

qd

qd

:)q,d(sim

Similarity metric:Ranking bydescendingrelevance

e.g., using: k ikijij wwd 2/:

iikk

ijij fwithdocs

docsdffreq

dffreqw

##

log),(max

),(:

tf*idfformula


Link Analysis for Content Authority

Search engine

Query (Set of weighted features)

||]1,0[ Fq

Ranking by descendingrelevance & authority

+ Consider in-degree and out-degree of Web nodes: Authority Rank (di) :=

Stationary visit probability [di]

in random walk on the WebReconciliation of relevance and authoritybased on weighted sum (with ad hoc weights)


... ...

...

PageRank Authority ComputationBasic model: random surfer follows outgoing links with uniform prob.and occasionally makes random jumps

( , )/ (1 ) / ( )j i

i j Gr n r out i

with 0 < 0.25

(1 ) 'r p A r

with A‘ij = 1/out(i) or 0 and pi = 1/n

is vector of stationary prob‘s for ergodic Markov chain

compute by power iteration for principal Eigenvector

?


PageRank Authority ComputationBasic model: random surfer follows outgoing links with uniform prob.and occasionally makes random jumps

Research issues:• improve efficiency (recent work at Stanford and INRIA)• biased random jumps for personalized or topic-sensitive PR• biased PR based on query logs & click streams• combine PR authority with TS-based freshness measures

( , )/ (1 ) / ( )j i

i j Gr n r out i

with 0 < 0.25

(1 ) 'r p A r

with A‘ij = 1/out(i) or 0 and pi = 1/n

is vector of stationary prob‘s for ergodic Markov chain

compute by power iteration for principal Eigenvector


LSI: Unsupervised Learning in IR

term i

doc j

...........................

......

.....A

mn

=

mrrr rn

latenttopic t

......

......

..

U

........... ..............................

1

r0

0

V T

......

...

doc j

latenttopic t

...........................

......

.....

mn

mkkk kn

......

......

..

Uk

........ ........................1

k00

k VkT

......

.

dUd Tk

'

qUq Tk

'

'( , ) '

Tjjsim d q d q

��

Research issues:• improve efficiency• understand LSI alternatives and tuning options• combine latent-space similarities with explicit ontology• combine multiple, distributed LSI spaces

Latent Semantic Indexing based on SVD

*

'TT

k kj

V q

' T Tk k kA U A V


Top-k Query Processing with Scoring

Naive QP algorithm: candidate-docs := ; for i=1 to z do { candidate-docs := candidate-docs index-lookup(ti) }; for each dj candidate-docs do {compute score(q,dj)}; sort candidate-docs by score(q,dj) descending;

algorithm

B+ tree on terms

17: 0.344: 0.4

...

performance... z-transform...

52: 0.153: 0.855: 0.6

12: 0.514: 0.4

...

28: 0.144: 0.251: 0.652: 0.3

17: 0.128: 0.7

...

17: 0.317: 0.144: 0.4

44: 0.2

11: 0.6index lists with(DocId, tf*idf)sorted by DocId

Given: query q = t1 t2 ... tz with z (conjunctive) keywords similarity scoring function score(q,d) for docs dD, e.g.: Find: top k results with regard to score(q,d)

Google:> 10 Mio. terms> 4 Bio. docs> 2 TB index

dq

real Web search engines: heuristic pruning for score(q,d) = auth(d) + csim(q,d) with various tricks


TA-Sorted (Fagin, Güntzer et al., ...)scan index lists in parallel: consider dj at position posi in Li; E(dj) := E(dj) {i}; highi := si(q,dj); bestscore(dj) := aggr{x1, ..., xm) with xi := si(q,dj) for iE(dj), highi for i E(dj); worstscore(dj) := aggr{x1, ..., xm) with xi := si(q,dj) for iE(dj), 0 for i E(dj); top-k := k docs with largest worstscore; if min worstscore among top-k bestscore{d | d not in top-k} then exit;

m=3aggr: sumk=2

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

top-k:

candidates:

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

f: 0.7 + ? 0.7 + 0.1

a: 0.95

h: 0.35 + ? 0.35 + 0.5

b: 0.8

d: 0.35 + ? 0.35 + 0.5c: 0.35 + ? 0.35 + 0.3

g: 0.2 + ? 0.2 + 0.4

h: 0.45 + ? 0.45 + 0.2

d: 0.35 + ? 0.35 + 0.3


Top-k Queries with Probabilistic GuaranteesTA family of algorithms based on invariant (with sum as aggr)

( ) ( ) ( )( ) ( ) ( )i i i

i E d i E d i E ds d s d s d high

Relaxed into probabilistic invariant

( ) ( )( ) : [ ( ) ] [ ( ) ]k i i k

i E d i E dp d P s d worst P s d S worst

( ) ( ) ( )[ ( )] : [ ]i k i ii E d i E d i E d

P S worst s d P S

where RV Si has some (postulated and/or estimated) distribution in the interval (0,highi]

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

S1S2 S3

speedup > 10 onTREC-12 .GOV benchmarkwith 80 percentprecision & recallof TA-sorted


4) P2P Search Engine Architecture

Close relationships with architectures for meta search enginesbut also major differences

summary

peer

localindex

Architectural approach:• every peer is autonomous and has its own local SE• every peer posts (statistical) summary info about its contents• query routing is driven by query-summaries similarities• summaries are organized into a distributed registry

• maintained at selected super-peers• mapped onto DHT• lazily replicated at all peers (via „gossiping“)


P2P SE ModelData space: m terms T = {t1, ..., tm}, n docs D = {d1, ..., dn}

Peer space: p peers P = {1, ..., p}, each peer k has• index lists for terms Tk T (usually |Tk| << |T|),• bookmarks Bk D (|Bk| << |D|) or other profile info• cached docs Dk D • QoS parameters (e.g., index list lengths, score & authority distr.)

P2P system: each peer k globally posts subsets Tk‘ Tk, Bk‘ Bk, Dk‘ Dk, plus QoSinducing global mappings (directories):• systerms: T 2T with systerms(t) = {k P | t Tk‘}• sysbm: D 2T with sysbm(d) = {k P | d Bk‘}• syscd: D 2T with syscd(d) = {k P | d Dk‘}

systerms, sysbm, syscd could be organized as separate DHTs using hash1(t), hash2(d), hash3(d)


P2P Web Search

query: a b c

querying peer needs to1. determine interesting peers2. plan, run, monitor, and adapt distributed top-k algorithm3. reconcile results from different peers

objective: max. result quality

execution cost

a 95 19 14 22 73 44 ...c 88 17 44 11 ...

b 14 28 29 ...c 44 11 ...

a 73 11 27 14 ...b 92 13 11 14 ...

c 85 88 ...


5) SP6 Research Challenges and OpportunitiesVision: Decentralized „Ultimate Google“ + Collaborative data mining on evolving large-scale data

Challenges:• Self-organizing P2P system with unlimited scalability• Leveraging intellectual input from millions of peers (bookmarks, link evolution, click streams, etc.) for better search quality

Layered approach:

P2P Web Search Engine

Enhanced DHT

DisseminationIncentives

MiningCollaboration

SP1, SP2 SP3 SP4SP5 ...


SP6 Work Packages and PartnersWP6.0: SP ManagementWP6.1: Collaborative Web Information SearchWP6.2: Enhanced Distributed Hash Tables for Keyword Search WP6.3: Self-Organizing Info. Dissemination & Load Sharing WP6.4: Mining Episodes and Data Streams WP6.5: Incentives for Collaborative Behaviour & Fairness Metrics WP6.6: P2P System Architecture & Testbed

UPBD

CTIGR

MPIID

UniBoI

UPCE

TU WPL

UniKaD

TelenorNO

UDRLSI

WP6.0 *WP6.1 + + *WP6.2 * + +WP6.3 * + + +WP6.4 + + + *WP6.5 * + +WP6.6 + *


Research Issues (1)

WP6.1: Collaborative Web Information Search• Exploit Large-scale collective intellectual input

(bookmarks, query logs, click streams, etc.)• Benefit/cost-aware query routing• Efficient distributed processing of

keyword-based top-k queries• Enhanced, distributed forms of LSI, PageRank, etc.

WP6.2: Enhanced Distributed Hash Tables for Keyword Search • DHT-style decentralized registry for

(variable-length) keyword queries• replication strategies and their properties• stability guarantees for DHTs


Research Issues (2)

WP6.3: Self-Organizing Info. Dissemination & Load Sharing•Dynamic, self-organizing load balancing•Replication strategies•Structured queries (e.g. range queries, string-attribute queries)•on decentralized directory (for metadata & QoS info)

WP6.4: Mining Episodes and Data Streams•Sampling on evolving Web data to learn “drifting concepts”•Temporally enabled authority analysis

WP6.5: Incentives for Collaborative Behaviour & Fairness Metrics •Statistical rewards and penalties for altruistic & egoistic peers•Incentives-enabled request/reply routing


SP6 Next Steps

Cross-SP Collaboration & Synergies

up for discussion !

to be discussed tomorrow


Some Backup Slides


Dimensions of a Large-Scale Search Engine

• > 4 Bio. (10**9) Web docs + 1 Bio. News docs > 10 Terabytes raw data• > 10 Mio. terms > 2 Terabytes index• > 150 Mio. queries per day < 1 sec. average response time• < 30 days index freshness > 1000 Web pages per second crawled

High-end server farm:> 10 000 Intel servers each with > 1 GB memory & 2 disks, with partitioned & mirrored data, distributed across all servers,plus load balancing of queries, remote administration, etc.


Differences between Meta and P2P Search Engines

Meta Search Engine P2P Search Engine

small # sites (e.g., digital libraries) huge # sites

rich statistics about site contents poor/limited/stale summaries

static federation of servers highly dynamic system

each query fully executed single query may need contentat each site from multiple peers

interconnection topology highly dependent on overlaylargely irrelevant network structure


Fagin’s TA (PODS 01, JCSS 03)

scan all lists Li (i=1..m) in parallel: consider dj at position posi in Li; highi := si(dj); if dj top-k then { look up s(dj) in all lists L with i; // random access compute s(dj) := aggr {s(dj) | =1..m}; if s(dj) > min score among top-k then add dj to top-k and remove min-score d from top-k; }; threshold := aggr {high | =1..m}; if min score among top-k threshold then exit;

m=3aggr: sumk=2

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

f: 0.75

a: 0.95

top-k:

b: 0.8

but random accesses are expensive !


Prob-sorted Algorithm (Conservative Variant)

Prob-sorted (RebuildPeriod r, QueueBound b):...scan all lists Li (i=1..m) in parallel: …same code as TA-sorted…

// queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // dropping of queues; multiple unbounded queues if strategy = Conservative then for all priority queues q do if prob[top(q) can qualify for top-k] < then drop all elements of q; if all queues are empty then exit;


Prob-sorted Algorithm (Smart Variant)Prob-sorted (RebuildPeriod r, QueueBound b):...scan all lists Li (i=1..m) in parallel: …same code as TA-sorted…

// queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // rebuild; single bounded queue if strategy = Smart then for all queue elements e in q do update bestscore(e) with current highi values; rebuild bounded queue with best b elements; if prob[top(q) can qualify for top-k] < then exit; if all queues are empty then exit;


Performance Results for .Gov Queries #

sor

ted

acc

esse

s

elap

sed

tim

e [s

] m

ax

queu

e

size

pr

ecis

ion

rank

di

stan

ce

scor

e er

ror

TA-sorted 2263652 148.7 10849 1 0 0 Prob-con 993414 25.6 29207 0.87 16.9 0.007 Prob-agg 20435 0.6 0 0.42 75.1 0.089 Prob-pro 1659706 44.2 6551 0.87 16.8 0.006 Prob-smart 527980 15.9 400 0.69 39.5 0.031

on .GOV corpus from TREC-12 Web track:1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“


Performance Results for IMDB Queries

# so

rted

a

cces

ses

elap

sed

tim

e [s

]

max

qu

eue

si

ze

prec

isio

n

rank

di

stan

ce

scor

e er

ror

TA-sorted 1003650 201.9 12628 1 0 0 Prob-con 463562 17.8 14990 0.71 119.9 0.18 Prob-agg 41821 0.7 0 0.18 171.5 0.39 Prob-pro 490041 69.0 9173 0.75 122.5 0.14 Prob-smart 403981 12.7 400 0.54 126.7 0.25

on IMDB corpus (Web site: Internet Movie Database):375 000 movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similaritiesof categorical attributes Genre and Actor, e.g.: Genre {Western} Actor {John Wayne, Katherine Hepburn} Description {sheriff, marshall}, Genre {Thriller} Actor {Arnold Schwarzenegger} Description {robot}


Performance Results: Sensitivity of

0

0,2

0,4

0,6

0,8

1

0.0

0.02

0.04

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

e

mac

ro-a

vg.

pre

cisi

onProb-con

Prob-pro

Prob-smart

Prob-agg

0

500.000

1.000.000

1.500.000

2.000.000

2.500.000

0.0

0.02

0.04

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

e

# so

rted

acc

esse

s


Exploiting Collective Human Input for Collaborative Web Search

- Beyond Relevance Feedback and Beyond Google -

• href links are human endorsements PageRank, etc.• Opportunity: online analysis of human input & behavior may compensate deficiencies of search engine

Typical scenario for 3-keyword user query: a & b & c top 10 results: user clicks on ranks 2, 5, 7

Challenge: How can we use knowledge about the collective input of all users in a large community?

top 10 results: user modifies query into a & b & c & d user modifies query into a & b & e user modifies query into a & b & NOT c top 10 results: user selects URL from bookmarks user jumps to portal user asks friend for tips

query logs, bookmarks, etc. provide• human assessments & endorsements • correlations among words & concepts and among documents