66
Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong

Range and kNN Searching in P2P

  • Upload
    gittel

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

Range and kNN Searching in P2P. Manesh Subhash Ni Yuan Sun Chong. Outline. Range query searching in P2P one dimension range query multi-dimension range query comparison of range query searching in P2P kNN searching in P2P scalable nearest neighbor searching PierSearch - PowerPoint PPT Presentation

Citation preview

Page 1: Range and kNN  Searching in P2P

Range and kNN

Searching in P2P

Manesh Subhash

Ni Yuan

Sun Chong

Page 2: Range and kNN  Searching in P2P

Outline Range query searching in P2P

one dimension range query

multi-dimension range query

comparison of range query searching in P2P

kNN searching in P2P scalable nearest neighbor searching

PierSearch

Conclusion

Page 3: Range and kNN  Searching in P2P

Motivation

Most P2P systems support only simple lookup queries

The DHT based approaches such as Chord, CAN are not suitable for range queries

More complicated queries such as range query and kNN searching is needed

Page 4: Range and kNN  Searching in P2P

P-Tree [APJ+04]

B+-tree is widely used for efficiently evaluating range queries in centralized database

distributed B+-tree is not directly applicable in a P2P environment

fully independent B+-tree semi-independent B+-tree, i.e. P-tree

Page 5: Range and kNN  Searching in P2P

Fully independent B+-tree4 24 26

4 8 12 20 24 25 26 35

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

26 8 20

26 35 4 8 12 20 24 25

24 35 8

24 25 26 35 4 8 12 20

Page 6: Range and kNN  Searching in P2P

Semi-independently B+-tree

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

4 24 26

4 8 12 20

24 35 8

24 25 26

12 25 35

12 20 24

26 8 20

26 35 4

8 24 35

8 12 20

20 25 4

20 24

25 8 20

25 26 35 4

35 8 24

35 4

Page 7: Range and kNN  Searching in P2P

Coverage & Separation

4 8 12 20 24 25 26 35

4 8

4 20 24 25

4 35

20 24 24 25 25 26 35 4

35 8 20 25

anti-coverage

overlap

Page 8: Range and kNN  Searching in P2P

Properties of P-tree

Each node stores O(logdN) nodes

Total storage per node is O(d*logdN) Require no global coordination among all

peers The search cost for a range query that

returns m results is O(m + logdN)

Page 9: Range and kNN  Searching in P2P

Search Algorithm p1: 21<value <29

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

4 24 26

4 8 12 20

24 35 8

24 25 26

12 25 35

12 20 24

26 8 20

26 35 4

8 24 35

8 12 20

20 25 4

20 24

25 8 20

25 26 35 4

35 8 24

35 4

l0

l1

Page 10: Range and kNN  Searching in P2P

Multi-dimension range query

Routing in one-dimensional routing space ZNet Z-ordering + Skip graph [STZ04] Hilbert space filling curve + Chord [SP03] SCRAP [GYG04]

Routing in multi-dimensional routing space MURK [GYG04]

Page 11: Range and kNN  Searching in P2P

Desiderata Locality : the data elements nearby in the

data space should be stored in the same node or the close nodes

Load balance : the amount of data stored by each node should be roughly the same

Efficient routing : the number of messages

exchanged between nodes for routing a query should be small

Page 12: Range and kNN  Searching in P2P

Hilbert SFC + Chord

SFC d-dimensional cube -> a line the line passes once through each point

in the volume of the cube

00 11

10 010101 0110

0100 0111

1001 1010

1000 1011

0011 0010

0000 0001

1101 1100

1110 1111

Page 13: Range and kNN  Searching in P2P

Hilbert SFC + Chord

mapping the 1-dimensional index space onto the Chord overlay network topological

0

4

811

14

data elements with keys 5, 6, 7, 8

Page 14: Range and kNN  Searching in P2P

Query Processing

translate the keyword query to relevant clusters of the SFC-based index space

query the appropriate nodes in the overlay network for data-elements

0101 0110

0100 0111

1001 1010

1000 1011

0011 0010

0000 0001

1101 1100

1110 1111

00 01 10 11

00

01

10

11

(1*, 0*)

0

4

811

141100

1101

1110

1111

Page 15: Range and kNN  Searching in P2P

000100

Query Optimization (010, *)

000 001 010 011 100 101 110 111

111

110

101

100

011

010

001

000

(000100)

000000

001001011110

111000

(000111, 001000)

(001011) (011000, 011001) (011101, 011110)

Page 16: Range and kNN  Searching in P2P

Query Optimization (cont.) (010,*)

01 10

1100

0

0001

00010010

01100111

000100 000111001000

001011 011000011001

011101011110

00 01 10 11

Page 17: Range and kNN  Searching in P2P

0

0001

00010010

01100111

000100 000111001000

001011 011000011001

011101011110

000000

001001011110

111000000100(010, *)

000

01

Query Optimization (cont.)

Pruning nodes from the tree

Page 18: Range and kNN  Searching in P2P

SCARP [GYG04]

Use z-order or Hilbert space filling curve to map multi-dimensional data down to a single dimension

Range partitioned the one dimension data across the available S nodes

Use Skip graph to rout the queries

Page 19: Range and kNN  Searching in P2P

MURK: Multi-dimensional Rectangulation with KD-tree Basic conception:

Partitioning high-dimensional data space into “rectangles”, managed by each node.

Partitioning is done based on the KD-tree. The space is split cyclically according to the dimensions and each leaf of the KD-tree corresponds to one rectangle.

Page 20: Range and kNN  Searching in P2P

Partitioning

Each node joins, split the space along one dimension into two parts of equal load, keeping load balance.

Each node manage data in one rectangle, thus keeping data locality.

Page 21: Range and kNN  Searching in P2P

Comparison with CAN

The partition based on KD-tree is similar as that in CAN. Both hash data into multi-dimensional space and try to keep load balancing

The major difference is that a new node splits the exiting node data space equally in CAN, rather than splitting load equality.

Page 22: Range and kNN  Searching in P2P

Routing in MURK

Routing is to create a link between all the neighboring nodes along the relevant nodes.

Based on the greedy routing over the “grid” links, the distance between two node is the minimum Manhattan distance.

Page 23: Range and kNN  Searching in P2P

Optimization for the routing

“Grid” links are not efficient for the routing. Maintain skip pointers for each node to

speed up the routing. Two methods to chose the skip pointers:Random. Chose randomly a node from node

set.Space-filling skip graph. Make the skip

pointers at exponentially increasing distance.

Page 24: Range and kNN  Searching in P2P

Discussion

Non-uniformity for the routing neighbors. Resulted from load balancing for the node.

The dynamic data distribution would result in the unbalance for the node data.

Page 25: Range and kNN  Searching in P2P

Performance

Page 26: Range and kNN  Searching in P2P

performance

Page 27: Range and kNN  Searching in P2P

Conclusion

For locality, MURK far outperforms SCRAP. For routing cost, SCRAP is efficient enough, skip pointers are efficient, such as space filling curve skip.

SCRAP using space filling with rang partitioning is efficient in low dimensions. MURK with space filling skip graph performs much better, especially in high dimensions.

Page 28: Range and kNN  Searching in P2P

pSearch

Motivation Numerous documents are over the internet. How to efficiently search the most closely related

document without returning too many with little interest.

Problem: Semantically, documents are randomly distributed. Exhaustively search brings overhead. No deterministic guarantees.

Page 29: Range and kNN  Searching in P2P

P2P & IR techniques

Unstructured p2p search Centralized index with the problem bottleneck. Flooding-based techniques result in too much overhead. Heuristic-based algorithm may miss some important documents.

Structured p2p search DHT based can and chord are suitable for keyword matching.

Traditional IR techniques Advanced IR ranking algorithm could be adopted into p2p search. Two IR techniques

Vector space model (VSM). Latent semantic indexing (LSI).

Page 30: Range and kNN  Searching in P2P

pSearch

An IR system built on p2p networks.Efficient and scalable as DHT Accurate as advanced IR algorithms.

Map semantic space to nodes and conduct nearest neighbor search.use VSM and LSI to generate semantic space use CAN to organize nodes.

Page 31: Range and kNN  Searching in P2P

VSM &LSI

VSM Document and queries are expressed as term vectors. Weight of a term: Term frequency* inverse document frequency. Rank based on the similarity of the document and query: cos

(X,Y). X and Y are two term vectors. LSI

Based on singular value decomposition, transform term vector from high-dimension to low-dimension (L) semantic vector.

Statistically based conception avoids synonymous and noise in document.

Page 32: Range and kNN  Searching in P2P

pSearch system

DOC

QUERY

Page 33: Range and kNN  Searching in P2P

Advantage of pSearch

Exhaustive search in a bounded area while could be ideally accurate.

Communication overhead is limited to transferring query and reference to top documents independent of the corpus size.

A good approximate of the global statistics is sufficient for pSearch.

Page 34: Range and kNN  Searching in P2P

Challenges

Dimensionality mismatch between CAN and LSI.

Uneven distribution of indices. Large search region.

Page 35: Range and kNN  Searching in P2P

Dimensionality mismatch

Not enough nodes (N) in the CAN to partition all the dimensions (L) in the LSI semantic space.

N nodes in CAN could partition log(N) low dimensions (effective dimensionality), leaving others un-partitioned.

Page 36: Range and kNN  Searching in P2P

Rolling index Motivation

Small part of the dimensions would contribute a lot to the similarity

Low-dimensions are of high importance. Partition more dimensions of the semantic space by

rotating the semantic vectors. A semantic vector V=(v0,v1,…,vl). Each time rotate the vector m

dimensions. The rotate space i is the vector of ith rotation.

Vi=(vi*m,…,v0,v1,…, vi*m-1) m=2.3*ln(n).

Use the rotated vector to route the query and guide the search.

Page 37: Range and kNN  Searching in P2P

Rolling index

Use more storage (p times) to keep the search in local space.

Selective rotation is expected to be efficient to process the important high dimensions

Page 38: Range and kNN  Searching in P2P

Balance index distribution

Content-aware node bootstrapping. Randomly select a document to publish .Route the node. Transfers load.

More indices would be distributed by more node. Even random, still balance with large corpus.

Page 39: Range and kNN  Searching in P2P

Reducing search space

Curse of dimensionality Data of high-dimensions sparsely populated In the high-dimension, distance between

nearest neighbor becomes large. Based on data locality, use stored indices

on nodes and recently processed query to guide new search.

Page 40: Range and kNN  Searching in P2P

Content-directed search

1

f

2 3 4 5 6

7 8 9 a 10

b

11

c

12

13

e

14

d

15 q 16 17 18

19 20 21 22

g

23 24

p

Page 41: Range and kNN  Searching in P2P

Performance

Page 42: Range and kNN  Searching in P2P

Conclusion

pSearch is a P2P IR system organizing contents around semantics and achieves good accuracy w.r.t system size, corpus size and returned document.

Rolling index resolve the dimension mismatch and could limit space overhead and visited node number.

Content-aware node bootstrapping balance node load to achieve index and query locality

Content–directed search reduce the searching nodes.

Page 43: Range and kNN  Searching in P2P

kNN searching in P2P Networks

Manesh Subhash

Ni Yuan

Sun Chong

Page 44: Range and kNN  Searching in P2P

Outline

Introduction to searching in P2P Nearest neighbor queries Presentation of the ideas in the papers

1. “A Scalable Nearest Neighbor Search in P2P Systems”

2. “Enhancing P2P File-Sharing with an Internet-Scale Query Processor”

Page 45: Range and kNN  Searching in P2P

Introduction to searching in P2P

Exact Match queriesSingle key retrievalLinear HashCAN, CHORD, PASTRY, TAPESTRY

Similarity based queriesMetric space based

What do we search for?Rare items or popular items or both.

Page 46: Range and kNN  Searching in P2P

Nearest neighbor queries

The notion of a metric spaceHow similar are two objects given a set of

objectsExtensible for exact, range and nearest

neighbor queries.Computationally expensiveDistance property satisfies positive-ness,

reflexivity, symmetry, triangle inequality.

Page 47: Range and kNN  Searching in P2P

Nearest neighbor queries (Cont)

Metric space is a pair (D, d) D : domain of objectsd : the distance function.

Similarity queriesRange

for F D, a range query retrieves all objects which have a distance < ρ to the query object q F

Nearest neighborReturns the object closest to q, k-nearest object

for kNN. K F

),(),(:,|| yqdxqdKFyKxkK

Page 48: Range and kNN  Searching in P2P

Scalable NN search

Uses the GHT* structure.Distributed metric indexSupports range and k-NN queries

The GHT* architecture is composed of nodes, peers that can insert, store and retrieve objects using similarity queries.

Assumptions: Message passing, unique network identifiers, Local buckets to store data and lastly, only one bucket per object.

Page 49: Range and kNN  Searching in P2P

Example of the GHT* NetworkPeer1 Peer2

To other peers

Network Node ID (NNID) or Bucket ID(BID)

Inner node

Bucket

Page 50: Range and kNN  Searching in P2P

Scalable NN search (3)

Address Search Trees (AST) Is a binary search tree Inner nodes hold routing information

Two pivots pointers to left and right sub-trees

Leaf nodes are pointers to data Local data is stored in the buckets and can be

accessed using the BID Non local data can be identified using NNID.

(All AST leaf nodes are one of the above pointers)

Page 51: Range and kNN  Searching in P2P

Scalable NN search (4)

Searching the AST? The BPATH

Is a representation of a tree as a string of n binary elements {0,1}: p = (b1,b2,…,bn)

Use the traversing operator Ψ and radius ρ for a query q. Ψ returns a BPATH.

Ψ examines every inner node using the two pivot values and decides which sub-tree to follow.

A radius of zero is used for exact matches and during inserts.

Page 52: Range and kNN  Searching in P2P

Scalable NN search (5)

k-NN searching in GHT* Range searching not suitable without intrinsic knowledge

of data and the metric space used. Begin search at bucket with high probability of

occurrences of k objects If k objects are found, then use kth object to define a

similarity search with radius of kth distance from q. Sort result and pick first k. If less than k objects found then we cannot determine the

upper bound on the search for the kth neighbor Variation on range radius

Page 53: Range and kNN  Searching in P2P

Scalable NN search (6)

Finding the k objects using range searches. Optimistic

Minimize distance computation costs, bucket access. Use bounding distance as that of the last candidate available at the

first accessed bucket. Iteratively expand radius if fewer than k found

Pessimistic. Probability of next iteration is minimized. Use distance between the pivot values at a level of the AST as range

radius starting from parent of leaf and executes the range query. If fever than k, move up the next level.

Page 54: Range and kNN  Searching in P2P

Scalable NN search (7)

Performance evaluation With increasing k

Number of parallel distance computations remain stable Number of bucket accesses and Number of Messages

increase rapidly

Effect of growing dataset Max hop count increases slowly Nearly constant parallel distance computation costs

Comparison with range Slightly slower because of overhead to locate first bucket

Page 55: Range and kNN  Searching in P2P

Scalable NN search (8)

Performance of the scheme on the TXT dataset.

Page 56: Range and kNN  Searching in P2P

Scalable NN search (9)

Conclusion First effort in distributed index structures

supporting K-NN searching. GHT* is a scalable solution Scope for future work includes handling

updates of the dataset. Other metric space partitioning schemes.

Page 57: Range and kNN  Searching in P2P

Enhanced P2P - PIERSearch (1)

Internet scale query processor Queried data has Zipfian distribution

Popular data in the headLong tail of rare items

PIERSearch is DHT based It’s a Hybrid system, uses Gnutella for popular

items, PIERSearch for rare items Integrated with the PIER system

Page 58: Range and kNN  Searching in P2P

PIERSearch (2)

Gnutella query processingFlooding basedSimple for popular filesOptimized using

ultra peers: nodes that perform the query processing on behalf of the leaf nodes,

dynamic querying: Larger TTL

Team studied characteristics of the Gnutella network.

Page 59: Range and kNN  Searching in P2P

PIERSearch (3)

Effectiveness of Gnutella Query recall: Percentage of available results in the

network returned Query distinct recall: Percentage of distinct results,

nullifies the effect of having replicas. Experiments show that Gnutella is efficient for highly

replicated content and those with large result set. Found ineffective for rare content. Increasing the TTL does not reduce latency but can

improve recall

Page 60: Range and kNN  Searching in P2P

PIERSearch (4)

Searching using PIERSearchKeyword based.Publisher maintains inverted file indexed using

the DHT. Generates two tuples for each item

Item(fileId,filename, filesiz, ipAddress, port) Inverted(keyword,fileId)

Uses the underlying PIER system A DHT based internet-scale relational query processor.

Page 61: Range and kNN  Searching in P2P

PIERSearch (5)

Hybrid system Identification of rare items

Query result size Smaller than fixed threshold considered rare.

Term frequency Items with at-least one term below threshold considered rare.

Term pair frequency Less prone to skew if filenames contain popular words.

Sampling Samples neighboring nodes and computes lower bound

estimate on the number of replicas.

Page 62: Range and kNN  Searching in P2P

PIERSearch (6)

Performance summary

Page 63: Range and kNN  Searching in P2P

PIERSearch (7)

ConclusionWe have found that Gnutella is highly

effective for querying popular content, but ineffective for rare items.

We have found that building a partial index over the least replicated content can improve query recall.

Page 64: Range and kNN  Searching in P2P

Referemce [APJ+04] A. Crainiceanu, P. Linga, J. Gehrke and J.

Shanmugasundaram. Querying Peer-to-Peer Networks Using P-Trees. In WebDB, 2004

[GYG04] P. Ganesan, B. Yang and H. Garcia-Molina. One Torus to Rule them all: Multi-dimensional Queries in P2P Systems. In WebDB, 2004

[SP03] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems. In HPDC, 2003

[STZ04] Y. Shu, K-L. Tan and A. Zhou. Adapting the Content Native Space for Load Balanced Indexing. In Database, Information Systems and Peer-to-Peer Computing, 2004

Page 65: Range and kNN  Searching in P2P

Reference (cont.) [LHH+04] B. Loo, J. Hellerstern, R. Huebsch, S.

Shenker and I. Stoica. Enhancing P2P File-sharing with an Internet-Scale Query Processor. In VLDB, 2004.

[TXD03] C. Tang, Z. Xu and S. Dwarkadas. Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003

[ZBG04] P. Zezula, M. Batko and C. Gennaro. A Scalable Nearest Neighbor Search in P2P Systems. In Database, Information Systems and Peer-to-Peer Computing, 2004

Page 66: Range and kNN  Searching in P2P

Thank you!