127
Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique Fédérale Lausanne Lausanne - Switzerland Josiane Xavier Parreira Max-Planck Institute for Informatics Saarbrücken - Germany

Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search

Sebastian Michel

Ecole Polytechnique Fédérale LausanneLausanne - Switzerland

Josiane Xavier Parreira

Max-Planck Institute for InformaticsSaarbrücken - Germany

Page 2: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 210/10/2007

P2P Systems

Known from Napster and othersSharing of mostly illegal content (mp3, movies)

P2P= Pirate-to-Pirate ??New kind of network organization; no client/server anymoreBasic Ideas:

Each peer connects to a few other peersAll peers together form powerful networks

Potential Benefits:No single point of failureLoad is spread across mulitple peers(Resilient to failures and dynamics)

Peer: “one that is of equal standing with another”

(source: Merriam-Webster Online Dictionary )

Page 3: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 310/10/2007

Napster

Publish file

statist

ics

File Download

File

Dow

nloa

d• Central server (index)• Client software sends informationabout users‘ contents to server.• User send queries to server• Server responds with IP of users that store matching files.

Peer-to-Peer file sharing!

• Developed in 1998.• First P2P file-sharing system

Page 4: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 410/10/2007

GnutellaProtocol for distributed file sharingStarted in 2000in 2005: 1.81 million computers connected*

Unstructured NetworkTruly decentralizedUses message flooding during query execution. Later: version with super nodes and query routing

* http://www.slyck.com/news.php?story=814

Page 5: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 510/10/2007

Gnutella Style

Paris Hilton?

TTL 3

TTL 3

TTL 2

TTL 2

TTL 2TTL 1

TTL 0TTL 1

TTL 1

TTL 0

Page 6: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 610/10/2007

Gnutella StylePros:

no complex statistical bookkeeping

Cons:lot of network traffic (NUMBERS?)some peers might not be reachable (TTL)

Page 7: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 710/10/2007

Bit TorrentIdea: Load sharing through file splittingA lot of (legal) software distributors offer software through Bit-torrentDownload information in small .torrent fileOne tracker node per file (specified in torrent file)

segment 1segment 2segment 3segment 4segment 5

tracker node

Client

segment 1segment 3

segment 5

segment 4segment 2

request randompeer list

requestsegments

File

Incentives: „tit-for-tat“Each peer remembers collaborative peers different priorities

Page 8: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 810/10/2007

LiteratureBook: Peer-to-Peer: Harnessing the Power of Disruptive Technologies by Andy Oram. O'Reilly Media, Inc.

Page 9: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 910/10/2007

Overlay NetworksOn top of existing networks

Different way to build an overlay networkstructuredunstructuredhybrid

Page 10: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1010/10/2007

Hierarchical Overlay Networks

Super nodes (peers)and leaf nodes.

Page 11: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1110/10/2007

Self* Properties (Promises)Self-Organizing:

evolves, grows..... without being guided/managed

Self-Healing:resilient to failures and dynamics

Self .............

Page 12: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1210/10/2007

Distributed Hash TablesHash-Table: given a key, return the bucket id. Based on a hash function (like SHA-1)Now: Distributed. For a given key, return the id of the peer currently responsible for the key.Challenge: Purely distributed protocols that cope with node failures, departures, arrivals. No central manager.

Page 13: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1310/10/2007

p1

p8

p14

p21

p32p38

p42

p48

p51

p56

Chorduses an m-bit identifier space ordered in a mod-2m circle, the Chord ring;maps peers and objects to identifiers in the Chord ring, using the hash function SHA-1 (base hash function, bhash());

uses consistent hashing:an object with identifier id is

placed on the successor peer, succ(id), which is the first node whose identifier is equal to, or follows id on the Chord ring

Key k (e.g., hash(file name))is assigned to the node withkey p (e.g., hash(IP address))such that k ≤ p and there isno node p‘ with k ≤ p‘ and p‘<p

k10

k24k30k38

k54

Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160

Page 14: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1410/10/2007

Chordpeer n maintains routing

information about peers that lie on the Chord ring at logarithmically increasing distance

Finger tables

Chord Ring

p1

p8

p56

p51

p48

p42

p38 p32p21

p14

p8 + 4p8 + 8p8 + 16

p8 + 2

p8 + 32

p8 + 1

p14

p21

p32

p14

p42

p14fingertable

p8

p42 + 4p42 + 8p42 + 16

p42 + 2

p42 + 32

p42 + 1

p48

p51

p1

p48

p14

p48

fingertablep42

p51 + 4p51 + 8p51 + 16

p51 + 2

p51 + 32

p51 + 1

p56

p1

p8

p56

p21

p56

fingertablep51

k54

Lookup(54)

Page 15: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1510/10/2007

Node Joins in Chord

p48

p38

p42

k40

k43

k39

p42 lookup(42)

k40

k39

sets succ pointerp42

moving keys

updates succ pointerp38

init_finger_tables()successor=node.find_successor()predecessor=successor.predecessorpredecessor.successor=new

Page 16: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1610/10/2007

And others ...P-Grid: Karl Aberer: P-Grid: A Self-Organizing Access Structure for P2P Information Systems. CoopIS 2001: 179-194 CAN: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, Scott Shenker: A scalable content-addressable network. 161-172Pastry: Antony I. T. Rowstron, Peter Druschel: Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. Middleware 2001: 329-350 Bamboo: Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handling Churn in a DHT. Proceedings of the USENIX Annual Technical Conference, June 2004. P-Grid: K. Aberer et al.

Page 17: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1710/10/2007

Range queries – Load Balancing

Range queriesObjects are database tuples of a k-attribute relation R(A1, A2, …Ak) with domain DAi

A range query [v1(A), v2(A)] on attribute A searches for those peers which store tuples t whose attribute Α has value v∈ [v1(A),v2(A)]

There are two main solutions to cope with load imbalances i.e. to perform load balancing:

transferring load, orreplicating data

Page 18: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1810/10/2007

DHTs and Range Queries

DHTs only support efficiently equality, or exact-match queries

The naïve approach to process range queries in DHTs is to:

query each value of a range individually It is HIGHLY EXPENSIVE!

Page 19: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 1910/10/2007

DHT and Range Queries (2)Existing approaches to deal with range queries:

Locality preserving hashingOP-Chord: Triantafillou et al (2003). Skip Graphs: Aspnes et al (2004)

Hashing ranges of values instead of each value individuallyCAN-based: Andrzejak et al (2002), Sahin et al (2004)

Approximation MethodsMin-wise independent permutation hashing: Gupta et al (2003)

Another problem in that context: access load imbalances

One possible solution: “hot data” transferring to deal with those load imbalances

However, data transfer does not solve access load imbalances in skewed access (query) distributions

Page 20: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2010/10/2007

Building a P2P Search Engine(Peer to Peer Information Retrieval)

“Distributed Google”

P2P approach best suitablelarge number of peersexploit mostly idle resourcesintellectual input of user communityscalable and self organizing

Page 21: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2110/10/2007

Information Retrieval Basics

Document Terms

5 x

7 x

4 x

# of terms(term frequency)

Page 22: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2210/10/2007

Information Retrieval Basics (2)

index lists with(DocId: tf*idf)sorted by Score

B+ tree on terms

Query Execution: Usually using some kind of threshold algorithm*:

- sequential scans over the index lists (round-robin)

- (random accesses to fetch missing scores)

- aggregate scores- stop when the threshold is reached

Top-k Query Processing: find k documents with the highest total score

e.g. Fagin’s algorithmTA or a variant without random accesses

d17: 0.3d44: 0.4

...d52: 0.1

d53: 0.8d55: 0.6 d12: 0.5

d14: 0.4

...

d28: 0.1

d51: 0.6

d52: 0.3

d28: 0.7

...

d17: 0.1

d44: 0.2

d11: 0.6

Part III of this tutorial!

Page 23: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2310/10/2007

Going distributed: Index Organization

peer indexevery peer has its own collection (full documents)distributed index = index of peer descriptions

document index

d17: 0.3d44: 0.4

...

d52: 0.1

d53: 0.8d55: 0.6 d12: 0.5

d14: 0.4...

d28: 0.1

d51: 0.6

d52: 0.3d44: 0.2

d28: 0.7

...

d17: 0.1d11: 0.6

Peer 1 Peer 2 Peer 3

Peer 2

Peer 1

Page 24: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2410/10/2007

(Full) Document IndexStraight forward from centralized document indexEach peer is responsible for storing the index list for a subset of terms.

p1

p8

p14

p21

p32p38

p42

p48

p51

p56Query Routing: DHT lookupsQuery Execution:Distributed Top-k [TPUT ’04, KLEE ‘05](cf. Part III of this tutorial)

Page 25: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2510/10/2007

Peer IndexEach peer has its own local index (e.g., created by web crawls)Peers publish compact per-term descriptions about their index

Query Routing:1. DHT lookups2. Retrieve Metadata3. Find most promising peers

Query Execution: - Send the complete Query and merge the incoming results

a: P1 P6 P4

b: P5 P3 P1 P6 ...

Distributed DirectoryTerm List of Peers

P1

P5

P6 P4

P2

P3

Page 26: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2610/10/2007

P2P Search with Minerva

book-marksB0

term g: 13, 11, 45, ...term a: 17, 11, 92, ...term f: 43, 65, 92, ...

peer lists (directory)

term g: 13, 11, 45, ...

term c: 13, 92, 45, ...url x: 37, 44, 12, ...

url y: 75, 43, 12, ...

url z: 54, 128, 7, ...

query peer P0

Query routing aims to optimize benefit/costdriven by distributed statistics onpeers‘ content quality, content overlap,freshness, authority, trust, etc.

Maintain semantic/social/statistical overlay network (SON)

local index X0

based onscalable,churn-resilientDHT withO(log n) key lookup

peer ranking& statistics

peer ranking& statistics

Exploit community behavior (bookmarks, links, tags, clicks, etc.)

Page 27: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2710/10/2007

Two major ProblemsTask of merging the obtained results into final ranking: Result Merging

Task of finding “high quality“ peers: Query Routingaka database/collection/peer selection

Overview articles:J. Callan. (2000). "Distributed information retrieval." In W. B. Croft, editor, Advances in Information Retrieval. KluwerAcademic Publishers. (pp. 127-150). Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002)

Page 28: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2810/10/2007

Query RoutingGiven a Query Q={term1, term2, ...., termN): select the most promising peersBased on:

per-term per-peer statisticsdocument frequencyvocabulary size

+ normalization issues likecollection frequencyavg vocabulary size

Most popular:CORI, GlOSS, Decision Theoretic Framework (DTF)

Page 29: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 2910/10/2007

CORI

cwavgcwdfdfT

_/*15050++=

ITbbcrpr ik **)1()|( −+=

)0.1log(

)5.0log(

+

+

=C

cfC

I

R1 R2 Rj-1 Rj

r1 r2 r3 rk

....

Apply document ranking to resource ranking

c1 c2 c3 cm

q

Query

Queryconcepts

Resources

Terms

∑∈

=Qr

ijij

Rrpn

QRS )|(1),(

Page 30: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3010/10/2007

LiteratureJ. Callan. (2000). "Distributed information retrieval." In W. B. Croft, editor, Advances in Information Retrieval. KluwerAcademic Publishers. (pp. 127-150). Weiyi Meng, Clement T. Yu, King-Lup Liu: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1): 48-89 (2002) CORI: James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28 GlOSS: Luis Gravano, Hector Garcia-Molina, Anthony Tomasic: GlOSS: Text-Source Discovery over the Internet. ACM Trans. Database Syst. 24(2): 229-264 (1999)Decision Theoretic Framework: Norbert Fuhr: A Decision-Theoretic Approach to Database Selection in Networked IR. ACM Trans. Inf. Syst. 17(3): 229-249 (1999)

Page 31: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3110/10/2007

Problem: incomparable scoresDifferent corpus statistics

df component used in tf*ids scoring functions is not globally knownuser with lot of high quality documents for term a high dfnon expert user with some bad documents for term a low df

Result Merging

Different scoring functionscompletely different functions different parameters in the same function

Page 32: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3210/10/2007

Result Merging ApproachesScore Normalization by

using global statisticscomputation of global statistics difficult (not obvious)solution using gossip

score re-computation with query initiator‘s local statistics

required re-ranking and knowledge about document contents

score re-computation using query routing scoresrouting score available anyway

4.1''**4.0'''

)/()(')/()('

minmaxmin

minmaxmin

i

i

ii

RDDD

DDDDDRRRRR

iii

+=

−−=−−=

Page 33: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3310/10/2007

Global DF Estimation

gdf (global doc. freq.) of a term is interesting key measure,but overlap among peers makes simple distr. counting infeasible

hash sketches [Flajolet/Martin 1985]:duplicate-sensitive cardinality estimator for multisets

hash each multiset element x onto m-bit bitvectorand remember ls 1 bit ρ(h(x))

maxx∈S ρ(h(x)) estimates ≈ log2 0.77351 |S|with std.dev. / |S| =

rough intuition: average multiple iid sketches

Page 34: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3410/10/2007

Global DF EstimationHash sketches of different peers collected at directory peerdistributivity is free!! ∪i {ρ(h(x)) | x ∈Si} = {ρ(h(x)) | x ∈ ∪i Si}

gdf estimation algorithm:each peer p posts hash sketch for each (discriminative) term t to

directorydirectory peer for term t forms union of incoming hash sketcheswhen a peer needs to know gdf(t), simply ask directory peer for tsliding-window techniques for dynamic adjustment

Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum: Global Document Frequency Estimation in Peer-to-Peer Web Search. WebDB 2006

Page 35: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3510/10/2007

Autonomous Peers Overlapping Sources

A

C

E

D

B?

?

query

ing pe

er

1 2 3 4

Rec

all

#peers

{A} {A,B} {A,B,C} {A,..,D}

overlap aware routing strategy:

1 2

Rec

all

#peers

{A} {A,E}

Page 36: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3610/10/2007

How?Enrich published statistics with overlap estimators.Interested in NOVELTY and QUALITYIterative greedy selection process

select first peer based on qualityselect next peer by quality*novelty

Suitable synopses for overlap estimation:

Bloom filter [Bloom 1979]hash sketches [Flajolet&Martin 1985]min wise independent permutations [Broder 1997]

Page 37: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3710/10/2007

Min-Wise Independent Permutations [Broder 97]

MIPs are unbiased estimator of overlap:P [min {h(x) | x∈A} = min {h(y) | y∈B}] = |A∩B| / |A∪B|

set of ids 17 21 3 12 24 8

20 48 24 36 18 8

40 9 21 15 24 46

9 21 18 45 30 33

h1(x) = 7x + 3 mod 51

h2(x) = 5x + 6 mod 51

hN(x) = 3x + 9 mod 51

compute N randompermutations with:

89

9

N

MIPsvector:minimaof perm.

89

3324369

82445244813

MIPs(set1)

MIPs(set2)

estimatedoverlap = 2/6

P[min{π(x)|x∈S}=π(x)] =1/|S|

Page 38: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3810/10/2007

Bloom Filter [Bloom 1979]bit array of size mk hash functions h_i: docId_space {1,..,m}insert n docs by hashing the ids and settings the corresponding bits document is in the Bloom Filter if the corresponding bits are setprobability of false positives (pfp)

tradeoff accuracy vs. efficiency

bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

h1 9h2 2

h1 14h2 6

1 1 11

h1 15h2 9

X

h1 6h2 2

111

And

rei B

rode

rand

Mic

hael

Mitz

enm

ache

r: N

etw

ork

App

licat

ions

of

Blo

om F

ilter

s: A

Sur

vey.

Inte

rnet

Mat

hem

atic

s 1(

4). 2

005.

Page 39: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 3910/10/2007

Multi-Key Statisticssolves interesting problem:

peer with lot of docs on american football and lots of documents about pop music has not a single document about american musiccannot be predicted using per-term statistics

)()()! and ( bqualityaqualitybaquality +=

)()( adfaquality ≈Obvious: Recall that

cwavgcwdfdfT

_/*15050++=....

Page 40: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4010/10/2007

Multi-Key Statistics in P2PMotivation:

quality(a and b) = quality(a) + quality (b) = df_a + df_b != df_(a and b)

Impossible (Infeasible) to consider all term-pairs, triplets, quadruples, .....Query Driven: Analyze query logs @ directory peers.+ Data driven verficication:

P[Anna|Kournikova] = ......P[Andy|Rodick] = P[Berlin|Marathon] =

No additional messages + shorter lists + highly accurateSebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181

addit

ional

statis

tics

often

not n

eede

d Whole processcan be easilyintegrated intoPeer-level P2P IR

Page 41: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4110/10/2007

SingleSingle--term vs. multiterm vs. multi--term P2P document term P2P document indexingindexing Single term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

→... ...

long posting lists

smal

l voc

.

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

→... ...

short posting lists

larg

evo

c.

PEER 1

...

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

→... ... PEER N

PEER 1

PEER N

...

Multi-term keysMulti term indexing

make use of highly discriminative keys

limit influence of overly long index lists

consider term pairs (triplets ...) for shorter listsefficient query

processing

Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686

Page 42: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4210/10/2007

LiteratureOverlap Awareness:

Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng, Clement T. Yu: Identifying redundant search engines in a very large scale metasearchengine context. WIDM 2006: 51-58 Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Improving collection selection with overlap awareness in P2P search engines. SIGIR 2005: 67-74 Thomas Hernandez, Subbarao Kambhampati: Improving text collection selection with coverage and overlap statistics. WWW (Special interest tracks and posters) 2005: 1128-1129

SketchesAndrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3): 630-659 (2000) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)Andrei Broder and Michael Mitzenmacher: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4). 2005.

Page 43: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4310/10/2007

LiteratureMulti-key statistics:

Sebastian Michel, Matthias Bender, Nikos Ntarmos, Peter Triantafillou, Gerhard Weikum, Christian Zimmer: Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices. CIKM 2006: 172-181Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer: Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys. ICDE 2007: 1096-1105Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko, Martin Rajman, Karl Aberer: Web text retrieval with a P2P query-driven index. SIGIR 2007: 679-686

Page 44: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4410/10/2007

Part II – Social Search

Page 45: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4510/10/2007

Page 46: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4610/10/2007

MotivationPeople connected through a network

People create links to other peopleLinks can express friendship, recommendations, etcDifferent graph structures appear

Sharing interestsEnables users to find others who share common interestsSimilar users can provide relevant content

Users and content spread at different sitesDistributed nature and continously increasing size call for peer-to-peer approaches

Page 47: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4710/10/2007

Outline of the Second PartLink Analysis: The Web as a Graph (30 min)

PageRankDistributed Approaches

BlockRankLocal PageRank + ServerRankAdaptive OPIC JXP

Identifying common interests – Semantic Overlay Networks (30 min)Crespo and Garcia MolinapSearchp2pDating

Social Networks – A new paradigm (30 min)What people shareSocial graphsLinks, Tags, users analysis

Page 48: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4810/10/2007

Links are everywhere……connecting Web pages

www.openp2p.com/...

www.searchtools.com

www.searchengines.com

www.searchengineguide.com

www.searchengineshowdown.com

searchenginewatch.com

Page 49: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 4910/10/2007

Links are everywhere……connecting people

Example of a Flickr’s friends network

Page 50: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5010/10/2007

Links are everywhere……connecting products

Page 51: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5110/10/2007

Links AnalysisThe set of nodes/pages (e.g., web pages, people, products, etc) and the links connecting them define a graph

www.openp2p.com/...

www.searchtools.com

www.searchengines.com

www.searchengineguide.com

www.searchengineshowdown.com

searchenginewatch.com

Page 52: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5210/10/2007

Link AnalysisAt the end we have something like this…

Lots of useful information can be obtained from the analysis of the such graphs

Page 53: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5310/10/2007

Adjacency MatrixMatrix representation of graphsGiven a graph G, its adjacency matrix A is nxn and

aij = 1, it there is a link from node i to node jaij = 0, otherwise

Page 54: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5410/10/2007

PageRank – Exploring the Wisdom of Crowds

Measures relative importance of pages on the graphImportance of a page depends on the importance of the pages that point to itRandom Surfer Model: once in a page, the surfer chooses to follow one of the outlinks with prob. α, or to jump to a random page with prob. (1- α)PR: probability of being at a certain page, after a enough number of jumps

S. Brin & L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW Conf. 1998.

Page 55: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5510/10/2007

PageRank – Formal Definition:

NppPRqPR

qpp

1)1()(out)()(

|×−+×= ∑

εε

N → Total number of pages;

PR(p) → PageRank of page p;

out(p) → Outdegree of p

ε→ Random jump probability

Can be computed using power iteration methodIn practice more efficient versions can be used

GGooooggllee is believed to use it on the Web graph, combined with other metrics, to rank their search results

Page 56: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5610/10/2007

A → Matrix containing the transition probabilities

where Pij = 1/out(i), if there is a link from i to j, 0 otherwise; E is the random jumps matrixProbability distribution vector at time k

is the starting vectorPageRank → Stationary distribution of the Markov Chain described by A, i.e., principal eigenvector or A

PageRank – Matrix Notation

)0()( xAx kk ρρ=

)(lim kk xPageRank ρ

∞→=

EPA T )1( εε −+=

)0(xρ

Page 57: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5710/10/2007

Going DistributedPageRank in principle needs the whole graph at one placeShortcomings:

Not Scalable for huge graphs, like the WebSlow update – PageRank in such huge graph can take weeksNot suitable for different network architectures (e.g. P2P)

Distributed approaches, where the graph is partitioned, are clearly neededSome distributed approaches (more details on the next slides):

Local PageRank + ServerRank (Wang et al.)BlockRank (Kamvar et al.)JXP (Parreira et al.)

Page 58: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5810/10/2007

The “Block Structure”Most of links are among web pages inside same host

1111111111111

111111111111111111

1111111111

111111111111

1111111111111

1111111111

1111111111111

11111111

1111111

111111

111111

1111111

Pages from Host A

Pages from Host B

Adjacency Matrix

Block structure can be exploited for speeding up and/or distributing the PR computation

Page 59: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 5910/10/2007

BlockRankPageRank in three steps:1. Computes “local PageRanks” of pages for each host, by

considering only intra host links2. Computes the importance of the host, using the local PR values

and the inter host links3. Combines previous values to create the starting vector for the

standard PR algorithm

Speeds up computationStep 1 can be parallelizedStill needs the whole matrix for step 3

S. Kamvar, T. Haveliwala, C. Manning & G. Golub. Exploiting the block structure of the web for computing pagerank. Technical report, Stanford University, 2003.

Page 60: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6010/10/2007

Going Distributed…Local PR + ServerRank

Similar to BlockRankLocal PR : PR computed inside each server using intra server linksServerRank: PR computed on server graph using inter server links

Server graph does not need to be materialized. Computation is done by exchanging messages among servers

Local PR and ServerRank are combined to approximate the true PR of a pageValues can be further refined by using Local PR info on ServerRank computation and vice versa.Server partition can be a limitation…

Y. Wang & D. J. DeWitt. Computing pagerank in a distributed internet search system. In VLDB, 2004.

Page 61: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6110/10/2007

Partition at “peer level”In P2P networks, server partition is not suitable

Global Graph

Peer B

Peer A

Peer C

Page 62: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6210/10/2007

Partition at “peer level”Every peer crawls Web fragments at its discretion

Peers have only local (incomplete) informationPages might be link to or linked by pages at other peersOverlaps between peers’ graphs may occurPeers a priori unaware of other peers’ contents

Peer B

Peer A

Peer C

Page 63: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6310/10/2007

Adaptive OPICOPIC: Online Page Importance ComputationComputes the importance of a page on-line, with few resourcesAlgorithm:

Pages initially receive some cashPages are randomly visitedWhen a page is visited, its cash is distributed between the pages it points toThe page importance for a given page is computed using the history of cash of that page

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In WWW, 2003.

Page 64: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6410/10/2007

Adaptive OPICExample:

Small Web of 3 pagesAlice has all the cash to start (Importance independent of the initial state)

Alice

Bob George

Cash-Game History:Alice received 600 (200+400) 40%Bob received 600 (200+100+300) 40%George received 300 (200+100) 20%

ABAGB

Page 65: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6510/10/2007

Adaptive OPICNo particular graph partitionNo need to store the link matrixAdapts to the changes on the web graph by considering only the recent part of the cash history for each page

Time window: [now-T, now]

High number of messages exchangedDoes not handle case where same page is stored at more than one place

Page 66: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6610/10/2007

The JXP AlgorithmDecentralized algorithm for computing global authority scores of pages in a P2P NetworkRuns locally at every peerNo coordinator, asynchronousCombines Local PageRank computations + Meetings between peersJXP scores converge to the true global PageRank scores

Josiane Xavier Parreira, Carlos Castillo, Debora Donato, Sebastian Michel and Gerhard Weikum: The JXP Method for Robust PageRank Approximation in a Peer-to-Peer Web Search Network. The VLDB Journal, 2007.

Page 67: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6710/10/2007

The JXP Algorithm“World Node”:

Special node attached to the local graph at every peerCompact representation of all other pages in the network“Special features”:

All links from local pages to external pages point to World NodeLinks from external pages that point to local pages (discovered during meetings) are represented at the World NodeScore and outdegree of these external pages are stored; World Node outgoing links are weighted to reflect score mass given by original linkSelf-loop link to represent transitions among external pages

W

Page 68: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6810/10/2007

The JXP AlgorithmInitialization step:

Local graph is extended by adding the world nodePageRank is computed in the extended graph → JXP Scores

Main algorithm (for every Pi in the network)Select Pj to meetUpdate world node

Add edges for pages in Pj that point to pages in Pi

If an edge already exists at the world node, the score of the source page is updated by taking the highest of both scores

Compute PageRank → JXP scores

Page 69: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 6910/10/2007

The JXP Algorithm

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

W node:

K → E

L → G

F

G

WE

G → C

F → A

E → B

Peer Y

W node:

K → E

L → G

F

G

WE

G → C

F → A

E → B

Peer Y

W node:

K → E

FF → E

F → G

Subgraph relevant to Peer X

F → A

W node:

K → E

FF → E

F → G

Subgraph relevant to Peer X

F → AW node:

G → C

J → E

F → A

F → E

K → E

A

B

D

E

WC

A → F

E → G

Peer X

W node:

G → C

J → E

F → A

F → E

K → E

A

B

D

E

WC

A → F

E → G

Peer X

Theorem: “In a fair series of JXP meetings, the JXP scores of all nodes converge to the true global PR scores”

Page 70: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7010/10/2007

Locating parts of the Graph“Finding peers that share common interests”Many applications can benefit from itDistributed PR

In principle, peers need to send content only to the peers that contain their successorsRandom messages guarantees that those peers will eventually be reached, but part of messages will be “wasted”

Page 71: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7110/10/2007

W node:

P → M

L

N

WM

N → S

M → R

Peer Z

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

Subgraph relevant to Peer X

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

W node:

G → C

J → E

A

B

D

E

WC

A → F

E → G

Peer X

WASTED MEETING!!!!

We want to avoid it!!!

Page 72: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7210/10/2007

Locating parts of the GraphQuery answering

Ideal: Forward query only to peers that are more likely to provide good answers to itQuery flooding is very expensiveHash-based queries are not suitable for approximate queries

Page 73: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7310/10/2007

Locating parts of the GraphLocating “relevant” peers

Increase performanceReduce traffic load

Idea: Group peers according to the semantic of their content and place them into different overlay networks

Page 74: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7410/10/2007

Semantic Overlay NetworksPartition the P2P network into several thematic networksPeers with similar or beneficial/complementary content are “clustered” together

Queries for a content will be forwarded only to peers with such contentFlooding in smaller networks with smaller TTL (or more results with same)

Page 75: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7510/10/2007

Overlay Networks: Random vs. Semantic

RandomRandomPeers connect to a small set of random peersQueries are flooded through the networkPeers with unrelated content receive queryLow performance: High number of messagesMaximum recall (if all peers receive the query)

SemanticSemanticPeers connect to peers with related content → Cluster of peersPeers identify query’s topic and forward it only the set of peers on that topicMessages to peers with unrelated content are avoidedBetter performance: Smaller number of messagesHigh recall expected but not maximum, since some peers do not receive the query

Page 76: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7610/10/2007

When creating SONs…Two main things to consider

Node partitioningClustering criteria

Node partitioning - When does a peer belong to SON A?When it contains a doc of type AWhen it contains more than x docs of type A

Less peers per SON → more results soonerLess SONs per peer → less connections

Clustering criteria - Clustering must provide:Load-balance

Each category has similar number of nodesEach node belongs to a small number of categories

Easy and accurate way to classify a document

Page 77: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7710/10/2007

Crespo and Garcia-Molina

Uses a classification hierarchy to form the overlay networksDocuments and queries are classified into one or more conceptsQueries are forwarded to peers in the super/sub concepts

A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, January 2003.

Page 78: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7810/10/2007

Crespo and Garcia-MolinaReported results show a significant improvement on number of messages

Music file sharing scenario: To get half the documents that match a query:

SONs: 461 msgsGnutella: 1731 msgs

SON links are “logical”: Two peers that are connected on a SON can actually be many hops away from each otherRequirement that hierarchy and classification algorithm are shared among all nodes might be a problem

Page 79: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 7910/10/2007

pSearchSemantic Overlay on top of Content Addressable Networks (CANs)Latent Semantic Indexing (LSI) is used to generate a semantic vector for each document

Semantic vectors are used as keys to store docs indices in the CANIndices close in semantics are stored close in the overlay

Two types of operationsPublish document indicesProcess queries

Chunqiang Tang, Zhichen Xu, and Sandhya Dwarkadas. Peer-to-peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003.

Page 80: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8010/10/2007

pSearch Key Idea

doc querysemantic space

Page 81: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8110/10/2007

pSearch Key Idea

doc query

A B C

D E F

G H I

semantic space

Page 82: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8210/10/2007

Background: Vector Space ModelTerm Vectors represent documents and queries

Elements correspond to importance of term in document or vector

Statistical computation of vector elementsTerm frequency * inverse document frequency

Ranking of retrieved documentsSimilarity between document vector and query vector

Page 83: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8310/10/2007

Background: Vector Space Model

A: “books on computer networks”B: “network routing in P2P networks”Q: “P2P network”

computernetworkP2Prouting

vocabulary

0.50.500

Va

00.50.250.25

Vb

00.50.50

Vq

0.25 0.375

Page 84: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8410/10/2007

Background: Latent Semantic IndexingDocument vectors dimension has to match the dimension of the CAN networkLatent Semantic Indexing uses Singular Value Decomposition (SVD)

high-dimensional term vector to low-dimensional semantic vectorelements correspond to importance of abstract concept in document/query

Also helps to overcomes synonym problem (e.g., user looks for car and don’t find document about automobile)

Page 85: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8510/10/2007

Background: Latent Semantic Indexing

Va Vb

documents

terms …..

V’a V’b

semantic vectors

SVD …..

SVD: singular value decompositionReduce dimensionalitySuppress noiseDiscover word semantics

Car <-> Automobile

Page 86: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8610/10/2007

pSearch Basic Algorithm: Steps1. Receive a new document A: generate a semantic vector Va, store

the key in the index2. Receive a new query Q: generate a semantic vector Vq, route the

query in the overlay3. The query is flooded to nodes within a radius r

R determined by similarity threshold or number of wanted documents4. All receiving nodes do a local search and report references to

best matching documents

Page 87: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8710/10/2007

search region for the query

3 33

pSearch Illustration

query doc1

4 42

Page 88: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8810/10/2007

p2pDatingStart with a randomly connected networkPeers meet other peers they do not know (“blind dates”)

If a peer “likes” another it will remember it as a “friend”. A remembers B abstract link A → BDirected links preserves peers’ autonomy

SONs dynamically evolve from the meeting process

J. X. Parreira et al. p2pDating: Real Life Inspired Semantic Overlay Networks for Web Search. Information Processing & Management [43], 643-664

Randomly Connected Iteration N Iteration N+1

Page 89: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 8910/10/2007

p2pDatingFinding new friends

Random meetings (Blind dates)Meet friends of friends

A B

B’s Friends

If A and B are friends…… it is very likely the B’s friends are friends of A as well.

A

Page 90: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9010/10/2007

Defining Good FriendsCriteria for defining a good friend combination of different measures

History: Credits for good behavior in the pastResponse time, query result precision, etc…

Collection similarityCollection Overlap

Different ways of estimating the overlap between two collections

Number of links between peersEtc…

Peers might have more than one list of friends E.g., according to different criterias

Page 91: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9110/10/2007

Going Social…Before:

Only few content producers (e.g., companies, universities)Analysis was done using the content itself plus a few implicit recommendations (links)Very little information about the content consumers (mainly through query logs)

Nowadays:New technologies to facilitate content sharingContent consumers are now also content producers and content describers (e.g., explicit recommendations, tags, etc)More and more crowd wisdom that can be harvested

Page 92: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9210/10/2007

Page 93: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9310/10/2007

Social NetworksA social structure made of nodes (which are generally individuals or organizations) that are tied by one or more specific types of relations, such as

values visions ideas friendsconflictweb linksEtc

Social networks have been studied for over a century

Page 94: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9410/10/2007

Social Network ServicesEnable the creation of online social networks for communities ofpeople who share interests and activities, or who are interested in exploring the interests and activities of othersOnline communities offer an easy way for users to publish and share their content.

Page 95: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9510/10/2007

Social Networking GrowthSeveral social networking sites have experienced dramatic growth during the past year.

77413.171.51Tagged

17218.206.69Bebo

7824.1213.59Orkut

6524.6814.92Friendster

5628.1718.10Hi5

27052.1714.08Facebook

72114.1566.41MySpace

% ChangeJun-07Jun-06

Total Unique Visitors (Mio.)

Social Networking Site

Worldwide Growth of Selected Social Networking Sites.June 2007 vs. June 2006, Users Age 15+, Source:

comeScore

Page 96: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9610/10/2007

What people share…

Page 97: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9710/10/2007

Social NetworksBesides sharing content, a user can…

…describe documents using tags…maintain a list of friends…make comments on other users’ content, exchange opinions, discover users with similar profile.

In contrast to Web Graph, in Social Graphs users are part of the model

Page 98: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9810/10/2007

Social Content Graph

Sihem Amer-Yahia, Michael Benedikt, Philip Bohannon: Challenges in Searching Online Communities. IEEE Data Eng. Bull. 30(2): 23-31 (2007)

Page 99: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 9910/10/2007

Social GraphsOther models also possible

Directed vs. Undirected edgesEtc.

users

tags

docs

Standard IR techniques for Web retrieval are not effective on social networks - Lot of current research

dedicated on this area

Page 100: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10010/10/2007

Social NetworksThe Wisdom of Crowds: Beyond PR

Spectral analysis of various graphsE.g., SocialPageRank, FolkRank.

Tag semantic analysisDiscovering semantic from tags co-occurrenceE.g., SocialSimRank

Distributed ViewExploiting social relations to enhance searchE.g., PeerSpective

Page 101: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10110/10/2007

Link Analysis in Social NetworksSocialPageRank

Let MUT, MTD, MDU be the matrices corresponding to relations UsersTags, TagsDocs, DocsUsersCompute iteratively:

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using SocialAnnotation. WWW 2007

DDUU rMr ρρ×= '

TTDD rMr ρρ×= '

UUTT rMr ρρ×= '

a

c

b

Documents

Users

Tags

Page 102: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10210/10/2007

Link Analysis in Social NetworksFolkRank

Define graph G as union of graphs UsersTags, TagsDocs, DocsUsersAssume each user has personal preference vectorCompute iteratively:

FolkRank vector of docs is:

Andreas Hotho, Robert Jäschke, Christoph Schmitz, Gerd Stumme: Information Retrieval in Folksonomies: Search and Ranking. ESWC 2006: 411-426

prMrr DGDDρρρρ

γβα +×+=

00 => − γγ DD rrρρ

Page 103: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10310/10/2007

Tag SimilaritySocialSimRank

Idea: Similar annotations (tags) are usually assigned to similarweb pages by users with common interests. sim(t1, t2) ~ aggr {sim(d1,d2) | (t1,d1), (t2,d2)∈Tagging} sim(d1, d2) ~ aggr {sim(t1,t2) | (t1,d1), (t2,d2)∈Tagging}

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu: Optimizing Web Search Using SocialAnnotation. WWW 2007

Page 104: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10410/10/2007

Exploring friendship connectionsPeerSpective: users can query their friends’ viewed pages

HTTP proxies on users computers index all browsed contentWhen a Google search in performance, query is also send to the other proxies in parallel

Alan Mislove, Krishna P. Gummadi, and Peter Druschel. Exploiting Social Networks for Internet Search. HotNets, 2006.

Page 105: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10510/10/2007

Social NetworksNew paradigm of publishing and searching contentRich data

Different link structuresUsers input for free!!!

Relatively recent topic: Lots of research opportunitiesWorks mentioned are by no means complete, still a lot to do

Page 106: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10610/10/2007

Part III – Query Processing

Page 107: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10710/10/2007

For the IR people ....Why top-k?

Cannot take a look at all matching documentsE.g., Google provides millions of documents about Britney Spears

Requires ranking (scoring):In text retrieval for instance

+ of course pagerank if you wish

Remember Part one: Local Query Execution at each peer (peer-index-model)AND truly distributed top-k processing in the full document-index.

Page 108: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10810/10/2007

For the DB guys ... Table with schema (id, attribute, value)

SELECT id, aggr(value)from tablegroup by idsort by aggr(value) desclimit k

Page 109: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 10910/10/2007

For the networking guys ...

12kB192.168.1.4

23kB192.168.1.3

31kB192.168.1.7

Bytes in kBIP

12kB192.168.1.1

33kB192.168.1.3

81kB192.168.1.8

Bytes in kBIP

9kB192.168.1.1

21kB192.168.1.3

53kB192.168.1.4

Bytes in kBIP

12kB192.168.1.5

28kB192.168.1.4

29kB192.168.1.1

Bytes in kBIP

Network Monitoring Find clients that causehigh network traffic.

Page 110: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11010/10/2007

Computational Modelm lists with (itemId, score)-pairs sorted by score descending.One list per attribute (e.g. term)Aggregation function

aggr()Monotonicity is important

for all items a, b:

whith denoting the score of item x in list i

Goal: return the top-k items w.r.t. their aggregated (overall) scores

)()()()( baggraaggrbscoreascorei ii ≤⇒≤∀

)( ixscore

Page 111: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11110/10/2007

How to process this?Most popular: Family of threshold algorithms

Fagin, 1999Nepal/ Ramakrishna, 1999Güntzer/Balke/Kießling, 2001

Basic ideas:keep upper and lower score bound for each document

lowerbound = sum of scores we have seen so farassuming 0 for unseen dimensionsupperbound = lowerbound + highest possible value for unseen dimensions

know what we‘ve got already; know what do expectstop if no further step can improve the current (i.e. final) ranking

Page 112: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11210/10/2007

Fagin’s NRANRA(q,L):top-k := ∅; candidates := ∅; min-k := 0;scan all lists Li (i = 1..m) in parallel:

consider item d at position posi in Li;E(d) := E(d) ∪ {i};highi := si(qi,d);worstscore(d) := aggr{sν(qν,d)|ν∈E(d)};bestscore(d):= aggr{aggr{sν(qν,d)|ν∈E(d)}, aggr{highν|ν∉E(d)}};if worstscore(d) > min-k then

remove argmind’{worstscore(d’)|d’∈top-k} from top-k;add d to top-k min-k := min{worstscore(d’) | d’ ∈ top-k};

else if bestscore(d) > min-k thencandidates := candidates ∪ {d};

threshold := max {bestscore(d’) | d’∈ candidates};if threshold ≤ min-k then exit;

Page 113: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11310/10/2007

Index listsIndex lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)

2.40.7d103

2.40.8d642

2.40.9d781

Best-score

Worst-score

DocRank

2.10.7d104

2.10.8d643

1.91.4d232

2.01.4d781

Best-score

Worst-score

DocRank

2.01.2d6441.81.4d2332.01.4d7822.12.1d101

Best-score

Worst-score

DocRank

t1d780.9

d10.7

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!STOP!

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 2Scan

depth 3Scan

depth 3

k = 1

Top-k Search

Page 114: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11410/10/2007

Observation: pruning often overly conservative (deep scans, high memory for priority queue)

Evolution of a Candidate’s Score

Approximate top-k“What is the probability that d qualifies for the top-k ?”

scan depth

bestscored

worstscored

min-k

score drop dfrom the

candidate queue

Page 115: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11510/10/2007

Safe Thresholding vs. Probabilistic GuaranteesNRA based on invariant

Relaxed into probabilistic threshold test

Or equivalently, with

∑ ∑ ∑∈ ∈ ∉

+≤≤)( )( )(

)()()(dEi dEi dEi

ii highdsdsds

bestscored

worstscored

min-k

δ(d)

( ) : min { | ( )}k id s i E dδ = − ∈∑

worstscored bestscored

ε≤⎥⎦

⎤⎢⎣

⎡−>+= ∑ ∑

∈ ∉

kdsdsPdpdEi dEi

ii min)()(:)()( )(

εδ ≤⎥⎦

⎤⎢⎣

⎡>= ∑

∉ )()()()(

dEii ddsPdp

Page 116: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11610/10/2007

Expected Result QualityMissing relevant items

Probability pmiss of missing a true top-k object equals the probability of erroneously dropping a candidate from the queueFor each candidate pmiss ≤ εP[recall = r/k] = P[precision = r/k] =

E[precision] = E[recall] =

)()1( rkmiss

rmiss pp

rk −−⎟⎟⎠

⎞⎜⎜⎝

∑=

−==kr

krkrprecisionP..0

)1(/*]/[ ε

Page 117: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11710/10/2007

Going distributedKey Observations:

Network traffic is crucialNumber of round trips is crucial

Straight forward application of TA/NRA?expensive: huge number of rounds tripseven with batching: unpredictable performance

Page 118: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11810/10/2007

Where is the data?

Considernetwork consumptionper peer loadlatency (query response time)

networkI/Oprocessing

P0

P1

P2

P3

…t1d780.9

d10.7

d880.2

d230.8

d100.8

…d100.2

d780.1t2

d640.8

d230.6

d100.6

…d990.2

d340.1t3

d100.7

d780.5

d640.4

P4

P5P2

P3P1 P4

P5P2

P3P1

Page 119: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 11910/10/2007

Three Phase Uniform Threshold Algorithm[Cao and Wang, PODC 2004]

Exactly 3 phases:1. fetch k best entries (d, sj) from each of P1 ... Pm and

aggregate (∑j=1..m sj(d)) at query initiator 2. ask each of P1 ... Pm for all entries with sj > min-k / m

and aggregate results at query initiator. min-k is score of item currently at rank k.

3. fetch missing scores for all candidates by random lookups at P1 ... Pm

First distributed top-k algorithm with fixed number of phases!

Page 120: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12010/10/2007

...

Index List

CohortPeer Pi

CoordinatorPeer P0

currenttop-k-

candidateset

...

score

Index List

CohortPeer Pj

score

topk top

k

cand

idat

es

cand

idat

es

min-k / m min-k / m

min-k / m

Retri

eve

mis

sing

sco

res

Retrieve missing scores

Page 121: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12110/10/2007

Analysis of TPUTTheorem: TPUT is an exact algorithm, i.e. identifies the true top-k items

Proof (sketch): TPUT cannot miss a true top-k item. Assume it misses one, i.e. item is below

mink/m in all lists. overall score < mink

not a true top-k item!list 1 list 2 list 3

min-k score< min-k

State afterphase 2:

Page 122: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12210/10/2007

if mink / m is small TPUT retrieves a lot of data in Phase 2 high network traffic

random accesses high per-peer load

KLEE [VLDB ‘05]Different philosophy: approximate answers Efficiency:

Reduces (docId, score)-pair transfersno random accesses at each peer

Two pillars:The HistogramBlooms structureThe Candidate List Filter structure

Analysis of TPUT

Page 123: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12310/10/2007

Additional Data StructuresEqui-width histogram

+ Bloom filter for each cell+ average score per cell+ upper/lower score

score

#doc

s

01100

00101

01100

00101

00110

00101

01110

00101

01100

00101

10 Usage:During Phase 1:

+ fetch top-k fromeach list

+ top-c cells

“increase” the min-k / m threshold

Page 124: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12410/10/2007

KLEE...

Index List

CohortPeer Pi

CoordinatorPeer P0

currenttop-k-

candidateset

...

score

Index List

CohortPeer Pj

Histogram Histogram

b bits

00

01

01

10

00

01

01

10

01

01

10

10

01

01

10

10

01

00

10

10

01

00

10

10

00

01

00

10

00

01

00

10

01

00

01

11

01

00

01

11c cells

b bits

01

01

01

01

01

01

01

01

00

01

11

01

00

01

11

01

01

00

00

00

01

00

00

00

00

00

00

10

00

00

00

10

01

00

11

10

01

00

11

10

c cells

score

topk top

k

cand

idat

es

cand

idat

es

min-k / m min-k / m

Page 125: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12510/10/2007

KLEE– Candidate Set Reduction...

score

010010000100010001

Index List

CohortPeer Pi

topk

CoordinatorPeer P0

min-k / m

currenttop-k

candidateset

0000100000100000001

xx x

cand

idat

es

min-k / m

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

Page 126: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12610/10/2007

KLEE – Candidate Retrieval...

score

010010000100010001

Index List

CohortPeer Pi

topk

CoordinatorPeer P0

min-k / m

currenttop-k

candidateset

0000100000100000001

xx x

cand

idat

es

early stoppingpoint

candidate filter matrix

CohortPeer Pj

100010100000010001

0000100000100000001 0000100000100000001

Page 127: Peer-to-Peer Information Search · Peer-to-Peer Information Search - SBBD 2007 Tutorial 3 10/10/200 7 Napster P u b l i s h f i l e s t a t i s t i c s File Download File Download

Peer-to-Peer Information Search - SBBD 2007 Tutorial 12710/10/2007

LiteratureRonald Fagin: Combining Fuzzy Information from Multiple Systems. J. Comput. Syst. Sci. 58(1): 83-99 (1999) Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi-Feature Queries in Heterogeneous Environments. ITCC 2001: 622-628Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation with Probabilistic Guarantees. VLDB 2004: 648-659Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, Gerhard Weikum: IO-Top-k: Index-access Optimized Top-k Query Processing. VLDB 2006: 475-486Amélie Marian, Nicolas Bruno, Luis Gravano: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2): 319-362 (2004)Pei Cao, Zhe Wang: Efficient top-K query calculation in distributed networks. PODC 2004: 206-215Sebastian Michel, Peter Triantafillou, Gerhard Weikum: KLEE: A Framework for Distributed Top-k Query Algorithms. VLDB 2005: 637-648