26
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Embed Size (px)

DESCRIPTION

Paul Burstein: PIER, 11/10/20033 Motivation Inject a degree of distribution into databases Internet scale systems vs. hundred node systems Large scale applications requiring database functionaity

Citation preview

Page 1: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Querying the Internet with PIER

CS294-4Paul Burstein11/10/2003

Page 2: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

2

Outline Motivation Architecture Join Algorithms Evaluation Discussion

Page 3: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

3

Motivation Inject a degree of distribution into

databases Internet scale systems vs. hundred

node systems Large scale applications requiring

database functionaity

Page 4: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

4

Applications P2P Databases

Highly distributed and available data

Network Monitoring Intrusion detection Fingerprint queries

Page 5: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

5

Design Principles Relaxed Consistency

Sacrifice Consistency in face of Availability and Partition tolerance

Organic Scaling Growth with deployment

Natural Habitats for Data Data remains in original format with a DB

interface Standard Schemas

Achieved though common software

Page 6: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

6

Outline Motivation Architecture Join Algorithms Evaluation Discussion

Page 7: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

7

PIER Architecture

Page 8: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

8

DHT Design Implemented with CAN

and Chord Routing Layer

Mapping for keys Storage Manager

Node data storage Provider

Storage access interface for higher levels

Page 9: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

9

Routing & Storage Routing Layer

DHT-based API locationMapChange – local

key set change Storage Manager

Easy to realize API Efficient performance relative

to network Main-memory storage

manager used

Page 10: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

10

Provider Couples the routing

and storage layers namespace – relation resourceId – primary key

namespace + resourceId key instanceId – distinguishes objects with same

namespace and resourceID lifetime – item storage duration

multicast – contacts namespace’s nodes lscan – iterates over a node’s local data newData – application callback on data arrival

Page 11: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

11

PIER Query Processor Query dataflow engine Operators:

Selection, projection, joins, grouping, aggregation

Operators push and pull data Current data modification is

though the DHT interface Relaxed consistency and reachable snapshot

Working only with nodes reachable at the time a query is issued

Page 12: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

12

Outline Motivation Architecture Join Algorithms Evaluation Discussion

Page 13: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

13

Join Algorithms Symmetric Hash Join

Rehashes the relations Scan and copy

Fetch Matches One relation already hashed on join attribute

R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table

Page 14: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

14

Join Rewriting Aimed at lowering the bandwidth

utilization

Symmetric semi-join Local projections to join keys Global fetch matches join

Bloom joins Local bloom filters are published into

temporary namespaces Filters multicast to opposite relation’s nodes

Page 15: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

15

How does this scale?

Page 16: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

16

Outline Motivation Architecture Join Algorithms Evaluation Discussion

Page 17: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

17

Workload Parameters CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used

Page 18: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

18

Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and

memory utilizations ignored1. 100ms and 10Mbps fully connected links2. GT-ITM transit-stub topology

Page 19: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

19

Scalability 1MB data per node Fully-connected topology Variable number of

computation nodes

Network congestion is an issue with few computation nodes

How is the computation workload distributed?

Page 20: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

20

Join Algorithms (1/2) Infinite Bandwidth 1024 data and computation nodes Core join Algorithms

Perform faster Rewrites

Bloom Filter: two multicasts Semi-join: two CAN lookups

Page 21: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

21

Join Algorithms (2/2) Limited Bandwidth

10Mbps inbound capacity 25GB relations, 1024 nodes

Symmetric Hash Join Rehashes both tables

Semi-join Transfers only matching tuples

At 40% selectivity, bottleneck switches from computation nodes to query sites

Page 22: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

22

Soft State Failure detection and

recovery 15 second failure

detection 4096 nodes

Refresh period Time to reinsert lost

tuples

Page 23: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

23

Transit Stub Topology GT-ITM

4 Domains, 10 nodes per domain, 3 stubs per node

50ms, 10ms, 2ms latency 10Mbps inbound links

Similar trends as fully connected topology A bit longer end-to-end

delays

Page 24: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

24

Experimental Results 64 PCs on 1Gbps

network All nodes are

computation nodes

Page 25: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

25

Outline Motivation Architecture Join Algorithms Evaluation Discussion

Page 26: Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/2003

26

Discussion PIER presents a distributed query

engine

What remains to be done? DB issues Networking issues