Range and kNN Searching in P2P

Range and kNN

Searching in P2P

Manesh Subhash

Ni Yuan

Sun Chong

Outline Range query searching in P2P

one dimension range query

multi-dimension range query

comparison of range query searching in P2P

kNN searching in P2P scalable nearest neighbor searching

PierSearch

Conclusion

Motivation

Most P2P systems support only simple lookup queries

The DHT based approaches such as Chord, CAN are not suitable for range queries

More complicated queries such as range query and kNN searching is needed

P-Tree [APJ+04]

B+-tree is widely used for efficiently evaluating range queries in centralized database

distributed B+-tree is not directly applicable in a P2P environment

fully independent B+-tree semi-independent B+-tree, i.e. P-tree

Fully independent B+-tree4 24 26

4 8 12 20 24 25 26 35

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

26 8 20

26 35 4 8 12 20 24 25

24 35 8

24 25 26 35 4 8 12 20

Semi-independently B+-tree

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

4 24 26

4 8 12 20

24 35 8

24 25 26

12 25 35

12 20 24

26 8 20

26 35 4

8 24 35

8 12 20

20 25 4

20 24

25 8 20

25 26 35 4

35 8 24

35 4

Coverage & Separation

4 8 12 20 24 25 26 35

4 8

4 20 24 25

4 35

20 24 24 25 25 26 35 4

35 8 20 25

anti-coverage

overlap

Properties of P-tree

Each node stores O(logdN) nodes

Total storage per node is O(d*logdN) Require no global coordination among all

peers The search cost for a range query that

returns m results is O(m + logdN)

Search Algorithm p1: 21<value <29

P1: 4 P2: 8

P3: 12

P4: 20P5: 24

P6: 25

P7: 26

P8: 35

4 24 26

4 8 12 20

24 35 8

24 25 26

12 25 35

12 20 24

26 8 20

26 35 4

8 24 35

8 12 20

20 25 4

20 24

25 8 20

25 26 35 4

35 8 24

35 4

l0

l1

Multi-dimension range query

Routing in one-dimensional routing space ZNet Z-ordering + Skip graph [STZ04] Hilbert space filling curve + Chord [SP03] SCRAP [GYG04]

Routing in multi-dimensional routing space MURK [GYG04]

Desiderata Locality : the data elements nearby in the

data space should be stored in the same node or the close nodes

Load balance : the amount of data stored by each node should be roughly the same

Efficient routing : the number of messages

exchanged between nodes for routing a query should be small

Hilbert SFC + Chord

SFC d-dimensional cube -> a line the line passes once through each point

in the volume of the cube

00 11

10 010101 0110

0100 0111

1001 1010

1000 1011

0011 0010

0000 0001

1101 1100

1110 1111

Hilbert SFC + Chord

mapping the 1-dimensional index space onto the Chord overlay network topological

0

4

811

14

data elements with keys 5, 6, 7, 8

Query Processing

translate the keyword query to relevant clusters of the SFC-based index space

query the appropriate nodes in the overlay network for data-elements

0101 0110

0100 0111

1001 1010

1000 1011

0011 0010

0000 0001

1101 1100

1110 1111

00 01 10 11

00

01

10

11

(1*, 0*)

0

4

811

141100

1101

1110

1111

000100

Query Optimization (010, *)

000 001 010 011 100 101 110 111

111

110

101

100

011

010

001

000

(000100)

000000

001001011110

111000

(000111, 001000)

(001011) (011000, 011001) (011101, 011110)

Query Optimization (cont.) (010,*)

01 10

1100

0

0001

00010010

01100111

000100 000111001000

001011 011000011001

011101011110

00 01 10 11

0

0001

00010010

01100111

000100 000111001000

001011 011000011001

011101011110

000000

001001011110

111000000100(010, *)

000

01

Query Optimization (cont.)

Pruning nodes from the tree

SCARP [GYG04]

Use z-order or Hilbert space filling curve to map multi-dimensional data down to a single dimension

Range partitioned the one dimension data across the available S nodes

Use Skip graph to rout the queries

MURK: Multi-dimensional Rectangulation with KD-tree Basic conception:

Partitioning high-dimensional data space into “rectangles”, managed by each node.

Partitioning is done based on the KD-tree. The space is split cyclically according to the dimensions and each leaf of the KD-tree corresponds to one rectangle.

Partitioning

Each node joins, split the space along one dimension into two parts of equal load, keeping load balance.

Each node manage data in one rectangle, thus keeping data locality.

Comparison with CAN

The partition based on KD-tree is similar as that in CAN. Both hash data into multi-dimensional space and try to keep load balancing

The major difference is that a new node splits the exiting node data space equally in CAN, rather than splitting load equality.

Routing in MURK

Routing is to create a link between all the neighboring nodes along the relevant nodes.

Based on the greedy routing over the “grid” links, the distance between two node is the minimum Manhattan distance.

Optimization for the routing

“Grid” links are not efficient for the routing. Maintain skip pointers for each node to

speed up the routing. Two methods to chose the skip pointers:Random. Chose randomly a node from node

set.Space-filling skip graph. Make the skip

pointers at exponentially increasing distance.

Discussion

Non-uniformity for the routing neighbors. Resulted from load balancing for the node.

The dynamic data distribution would result in the unbalance for the node data.

Performance

performance

Conclusion

For locality, MURK far outperforms SCRAP. For routing cost, SCRAP is efficient enough, skip pointers are efficient, such as space filling curve skip.

SCRAP using space filling with rang partitioning is efficient in low dimensions. MURK with space filling skip graph performs much better, especially in high dimensions.

pSearch

Motivation Numerous documents are over the internet. How to efficiently search the most closely related

document without returning too many with little interest.

Problem: Semantically, documents are randomly distributed. Exhaustively search brings overhead. No deterministic guarantees.

P2P & IR techniques

Unstructured p2p search Centralized index with the problem bottleneck. Flooding-based techniques result in too much overhead. Heuristic-based algorithm may miss some important documents.

Structured p2p search DHT based can and chord are suitable for keyword matching.

Traditional IR techniques Advanced IR ranking algorithm could be adopted into p2p search. Two IR techniques

Vector space model (VSM). Latent semantic indexing (LSI).

pSearch

An IR system built on p2p networks.Efficient and scalable as DHT Accurate as advanced IR algorithms.

Map semantic space to nodes and conduct nearest neighbor search.use VSM and LSI to generate semantic space use CAN to organize nodes.

VSM &LSI

VSM Document and queries are expressed as term vectors. Weight of a term: Term frequency* inverse document frequency. Rank based on the similarity of the document and query: cos

(X,Y). X and Y are two term vectors. LSI

Based on singular value decomposition, transform term vector from high-dimension to low-dimension (L) semantic vector.

Statistically based conception avoids synonymous and noise in document.

pSearch system

DOC

QUERY

Advantage of pSearch

Exhaustive search in a bounded area while could be ideally accurate.

Communication overhead is limited to transferring query and reference to top documents independent of the corpus size.

A good approximate of the global statistics is sufficient for pSearch.

Challenges

Dimensionality mismatch between CAN and LSI.

Uneven distribution of indices. Large search region.

Dimensionality mismatch

Not enough nodes (N) in the CAN to partition all the dimensions (L) in the LSI semantic space.

N nodes in CAN could partition log(N) low dimensions (effective dimensionality), leaving others un-partitioned.

Rolling index Motivation

Small part of the dimensions would contribute a lot to the similarity

Low-dimensions are of high importance. Partition more dimensions of the semantic space by

rotating the semantic vectors. A semantic vector V=(v0,v1,…,vl). Each time rotate the vector m

dimensions. The rotate space i is the vector of ith rotation.

Vi=(vi*m,…,v0,v1,…, vi*m-1) m=2.3*ln(n).

Use the rotated vector to route the query and guide the search.

Rolling index

Use more storage (p times) to keep the search in local space.

Selective rotation is expected to be efficient to process the important high dimensions

Balance index distribution

Content-aware node bootstrapping. Randomly select a document to publish .Route the node. Transfers load.

More indices would be distributed by more node. Even random, still balance with large corpus.

Reducing search space

Curse of dimensionality Data of high-dimensions sparsely populated In the high-dimension, distance between

nearest neighbor becomes large. Based on data locality, use stored indices

on nodes and recently processed query to guide new search.

Content-directed search

1

f

2 3 4 5 6

7 8 9 a 10

b

11

c

12

13

e

14

d

15 q 16 17 18

19 20 21 22

g

23 24

p

Performance

Conclusion

pSearch is a P2P IR system organizing contents around semantics and achieves good accuracy w.r.t system size, corpus size and returned document.

Rolling index resolve the dimension mismatch and could limit space overhead and visited node number.

Content-aware node bootstrapping balance node load to achieve index and query locality

Content–directed search reduce the searching nodes.

kNN searching in P2P Networks

Manesh Subhash

Ni Yuan

Sun Chong

Outline

Introduction to searching in P2P Nearest neighbor queries Presentation of the ideas in the papers

1. “A Scalable Nearest Neighbor Search in P2P Systems”

2. “Enhancing P2P File-Sharing with an Internet-Scale Query Processor”

Introduction to searching in P2P

Exact Match queriesSingle key retrievalLinear HashCAN, CHORD, PASTRY, TAPESTRY

Similarity based queriesMetric space based

What do we search for?Rare items or popular items or both.

Nearest neighbor queries

The notion of a metric spaceHow similar are two objects given a set of

objectsExtensible for exact, range and nearest

neighbor queries.Computationally expensiveDistance property satisfies positive-ness,

reflexivity, symmetry, triangle inequality.

Nearest neighbor queries (Cont)

Metric space is a pair (D, d) D : domain of objectsd : the distance function.

Similarity queriesRange

for F D, a range query retrieves all objects which have a distance < ρ to the query object q F

Nearest neighborReturns the object closest to q, k-nearest object

for kNN. K F

),(),(:,|| yqdxqdKFyKxkK

Scalable NN search

Uses the GHT* structure.Distributed metric indexSupports range and k-NN queries

The GHT* architecture is composed of nodes, peers that can insert, store and retrieve objects using similarity queries.

Assumptions: Message passing, unique network identifiers, Local buckets to store data and lastly, only one bucket per object.

Example of the GHT* NetworkPeer1 Peer2

To other peers

Network Node ID (NNID) or Bucket ID(BID)

Inner node

Bucket

Scalable NN search (3)

Address Search Trees (AST) Is a binary search tree Inner nodes hold routing information

Two pivots pointers to left and right sub-trees

Leaf nodes are pointers to data Local data is stored in the buckets and can be

accessed using the BID Non local data can be identified using NNID.

(All AST leaf nodes are one of the above pointers)


Searching the AST? The BPATH

Is a representation of a tree as a string of n binary elements {0,1}: p = (b1,b2,…,bn)

Use the traversing operator Ψ and radius ρ for a query q. Ψ returns a BPATH.

Ψ examines every inner node using the two pivot values and decides which sub-tree to follow.

A radius of zero is used for exact matches and during inserts.


k-NN searching in GHT* Range searching not suitable without intrinsic knowledge

of data and the metric space used. Begin search at bucket with high probability of

occurrences of k objects If k objects are found, then use kth object to define a

similarity search with radius of kth distance from q. Sort result and pick first k. If less than k objects found then we cannot determine the

upper bound on the search for the kth neighbor Variation on range radius


Finding the k objects using range searches. Optimistic

Minimize distance computation costs, bucket access. Use bounding distance as that of the last candidate available at the

first accessed bucket. Iteratively expand radius if fewer than k found

Pessimistic. Probability of next iteration is minimized. Use distance between the pivot values at a level of the AST as range

radius starting from parent of leaf and executes the range query. If fever than k, move up the next level.


Performance evaluation With increasing k

Number of parallel distance computations remain stable Number of bucket accesses and Number of Messages

increase rapidly

Effect of growing dataset Max hop count increases slowly Nearly constant parallel distance computation costs

Comparison with range Slightly slower because of overhead to locate first bucket


Performance of the scheme on the TXT dataset.


Conclusion First effort in distributed index structures

supporting K-NN searching. GHT* is a scalable solution Scope for future work includes handling

updates of the dataset. Other metric space partitioning schemes.

Enhanced P2P - PIERSearch (1)

Internet scale query processor Queried data has Zipfian distribution

Popular data in the headLong tail of rare items

PIERSearch is DHT based It’s a Hybrid system, uses Gnutella for popular

items, PIERSearch for rare items Integrated with the PIER system

PIERSearch (2)

Gnutella query processingFlooding basedSimple for popular filesOptimized using

ultra peers: nodes that perform the query processing on behalf of the leaf nodes,

dynamic querying: Larger TTL

Team studied characteristics of the Gnutella network.

PIERSearch (3)

Effectiveness of Gnutella Query recall: Percentage of available results in the

network returned Query distinct recall: Percentage of distinct results,

nullifies the effect of having replicas. Experiments show that Gnutella is efficient for highly

replicated content and those with large result set. Found ineffective for rare content. Increasing the TTL does not reduce latency but can

improve recall

PIERSearch (4)

Searching using PIERSearchKeyword based.Publisher maintains inverted file indexed using

the DHT. Generates two tuples for each item

Item(fileId,filename, filesiz, ipAddress, port) Inverted(keyword,fileId)

Uses the underlying PIER system A DHT based internet-scale relational query processor.

PIERSearch (5)

Hybrid system Identification of rare items

Query result size Smaller than fixed threshold considered rare.

Term frequency Items with at-least one term below threshold considered rare.

Term pair frequency Less prone to skew if filenames contain popular words.

Sampling Samples neighboring nodes and computes lower bound

estimate on the number of replicas.

PIERSearch (6)

Performance summary

PIERSearch (7)

ConclusionWe have found that Gnutella is highly

effective for querying popular content, but ineffective for rare items.

We have found that building a partial index over the least replicated content can improve query recall.

Referemce [APJ+04] A. Crainiceanu, P. Linga, J. Gehrke and J.

Shanmugasundaram. Querying Peer-to-Peer Networks Using P-Trees. In WebDB, 2004

[GYG04] P. Ganesan, B. Yang and H. Garcia-Molina. One Torus to Rule them all: Multi-dimensional Queries in P2P Systems. In WebDB, 2004

[SP03] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems. In HPDC, 2003

[STZ04] Y. Shu, K-L. Tan and A. Zhou. Adapting the Content Native Space for Load Balanced Indexing. In Database, Information Systems and Peer-to-Peer Computing, 2004

Reference (cont.) [LHH+04] B. Loo, J. Hellerstern, R. Huebsch, S.

Shenker and I. Stoica. Enhancing P2P File-sharing with an Internet-Scale Query Processor. In VLDB, 2004.

[TXD03] C. Tang, Z. Xu and S. Dwarkadas. Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM, 2003

[ZBG04] P. Zezula, M. Batko and C. Gennaro. A Scalable Nearest Neighbor Search in P2P Systems. In Database, Information Systems and Peer-to-Peer Computing, 2004

Thank you!

Documents

Range and kNN Searching in P2P