55
Paolo Costa [email protected] joint work with Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns)

Paolo Costa [email protected] - Vrije Universiteit …kielmann/hpdc-trends-2012/costa.pdf · Paolo Costa CamCube - Rethinking the Data Center Cluster 3/54

  • Upload
    lyxuyen

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Paolo Costa

[email protected]

joint work with Ant Rowstron, Austin Donnelly, and Greg O’Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns)

Paolo Costa CamCube - Rethinking the Data Center Cluster 2/54

• Data Centers – Large group of networked servers – Typically built from commodity hardware

• Scale-out vs. scale-up – Many low-cost PCs rather than few expensive high-end servers – Economies of scale

• Custom software stack – Microsoft: Autopilot, Dryad – Google: GFS, BigTable, MapReduce – Yahoo: Hadoop, HDFS – Amazon: Dynamo, Astrolabe – Facebook: Cassandra, Scribe, HBase

Focus of this talk

Paolo Costa CamCube - Rethinking the Data Center Cluster 3/54

• Mismatch between abstraction & reality

• How programmers see the network

Key-based Overlay networks Explicit structure Symmetry of role …

Paolo Costa CamCube - Rethinking the Data Center Cluster 4/54

Paolo Costa CamCube - Rethinking the Data Center Cluster 5/54

~ 20-40 servers / rack

~12 racks / pod

10 Gbps

10 Gbps

Paolo Costa CamCube - Rethinking the Data Center Cluster 6/54

~ 20-40 servers / rack

~12 racks / pod

10 Gbps

10 Gbps

Paolo Costa CamCube - Rethinking the Data Center Cluster 7/54

~ 20-40 servers / rack

~12 racks / pod

10 Gbps

10 Gbps

Paolo Costa CamCube - Rethinking the Data Center Cluster

Data center services are distributed but they have no control over the network Must infer network properties (e.g., locality) Must use inefficient overlay networks No bandwidth guarantee

8/54

• Top-down (applications)

– Custom routing? • Overlays

– Priority traffic? • Use multiple TCP flows

– Locality? • Use traceroute and ping

– TCP Incast? • Add random jitter

• Bottom-up (network)

– More bandwidth? • Use richer topologies

(e.g., fat-trees)

– Fate-sharing? • Use prediction models to

estimate flow length / traffic patterns

– TCP Incast? • Use ECN in routers

Paolo Costa CamCube - Rethinking the Data Center Cluster 9/24

• The network is still a black box

Good for a multi-owned, heterogeneous Internet

Limiting for a single-owned, homogenous data center

• DCs are not mini-Internets

– single owner / administration domain

– we know (and define) the topology

– low hardware/software diversity

– no malicious or selfish node

• Is TCP/IP still appropriate?

Paolo Costa CamCube - Rethinking the Data Center Cluster 10/54

• “A data center from the perspective of a distributed systems builder”

• What happens if you integrate network & compute?

HPC Networking

Distributed

Systems

CamCube

Paolo Costa CamCube - Rethinking the Data Center Cluster 11/54

Link and key-based

TCP/IP

Service composition

Topology

API

Architecture

Paolo Costa CamCube - Rethinking the Data Center Cluster 12/54

• No distinction between overlay and underlying network

• Nodes have (x,y,z)coordinate

– defines key-space

• Simple one hop API to send/receive packets

– all other functionality implemented as services

• Key coordinates are re-mapped in case of failures

– enable a KBR-like API

x

y

z

(1,2,1) (1,2,2) (1,2,1) (1,2,2)

Paolo Costa CamCube - Rethinking the Data Center Cluster 13/54

• No distinction between overlay and underlying network

• Nodes have (x,y,z)coordinate

– defines key-space

• Simple one hop API to send/receive packets

– all other functionality implemented as services

• Key coordinates are re-mapped in case of failures

– enable a KBR-like API

x

y

z

(1,2,2) (1,2,1)

Paolo Costa CamCube - Rethinking the Data Center Cluster 14/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 15/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 16/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 17/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 18/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 19/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 20/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 21/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 22/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 23/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 24/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 25/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 26/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Fetch mail costa:1234

Paolo Costa CamCube - Rethinking the Data Center Cluster 27/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Paolo Costa CamCube - Rethinking the Data Center Cluster 28/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

(1,1,0)

Paolo Costa CamCube - Rethinking the Data Center Cluster 29/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

(1,1,0)

Paolo Costa CamCube - Rethinking the Data Center Cluster 30/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

(1,1,0)

Paolo Costa CamCube - Rethinking the Data Center Cluster 31/54

• Packets move from service to service and server to server • Services can intercept and process packets along the way • Keys are re-mapped in case of failures • Server resources (RAM, Disks, CPUs) are accessible

to all services

(0,0,0) (1,0,0) (2,0,0)

Paolo Costa CamCube - Rethinking the Data Center Cluster 32/54

• 27-server (3x3x3) prototype – quad-core Intel Xeon 5520 2.27 Ghz

– 12GB RAM

– 6 Intel PRO/1000 PT 1 Gbps ports

– Runtime & services implemented in C# • Multicast, key-value store, MapReduce,

graph processing engine, …

• Experiment – all servers saturating 6 links

– 11.94 Gbps (9K pkts) using < 22% CPU

Paolo Costa CamCube - Rethinking the Data Center Cluster 33/54

• MapReduce

– originally developed by Google

– open-source version (Hadoop) used by several companies

• Yahoo, Facebook, Twitter, Amazon, …

– many applications

• data mining, search, indexing…

• Simple abstraction

– map: apply a function that generates (key, value) pairs

– reduce: combine intermediate results

Paolo Costa CamCube - Rethinking the Data Center Cluster 34/54

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Reduce

Paolo Costa CamCube - Rethinking the Data Center Cluster 35/54

Map Map

Map

Map Map

Map

Reduce

Reduce

Map Map

Map

Map Map

Map

Results stored on a single node Can handle all reduce functions Bottleneck (network and CPU)

Paolo Costa CamCube - Rethinking the Data Center Cluster 36/54

Map Map

Map

Map Map

Map

Reduce

Reduce

Reduce

Reduce

Reduce

Reduce

Reduce

Reduce

Map Map

Map

Map Map

Map

CPU & network load distributed N2 flows: needs full bisection bw results are split across R servers not always applicable (e.g., max)

Paolo Costa CamCube - Rethinking the Data Center Cluster 37/54

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Reduce

Paolo Costa CamCube - Rethinking the Data Center Cluster 38/54

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Reduce

Use hierarchical aggregation!

Paolo Costa CamCube - Rethinking the Data Center Cluster 39/54

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Paolo Costa CamCube - Rethinking the Data Center Cluster 40/54

Map Map

Map

Map Map

Map

Map Map

Map

Map Map

Map

Aggregate

Aggregate

Aggregate

Intermediate results can be aggregated at

each hop

Paolo Costa CamCube - Rethinking the Data Center Cluster 41/54

Paolo Costa CamCube - Rethinking the Data Center Cluster 42/54

In-network aggregation

Paolo Costa CamCube - Rethinking the Data Center Cluster 43/54

Traffic reduced at each hop No need for full bisection bw Final results stored on a single node

In-network aggregation

Paolo Costa CamCube - Rethinking the Data Center Cluster 44/54

• Camdoop builds 1 tree – failure resilient – (up to) 6 Gbps throughput / server

Paolo Costa CamCube - Rethinking the Data Center Cluster 45/54

• Camdoop builds 1 tree 6 disjoint trees – failure resilient – (up to) 6 Gbps throughput / server

Paolo Costa CamCube - Rethinking the Data Center Cluster 46/54

• 27-server testbed

• 27 GB input size

– 1GB / server

• Output size/ intermediate size (S)

– S=1/N ≈ 0 (full aggregation)

– S=1 (no aggregation)

• Two configurations:

– All-to-one

– All-to-all Paolo Costa CamCube - Rethinking the Data Center Cluster 47/54

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsca

le

Output size / intermediate size (S)

switch

Camdoop (no agg)

Camdoop

Switch 1 Gbps (bound)

No aggregation Full aggregation

Better

Worse

Paolo Costa CamCube - Rethinking the Data Center Cluster 48/54

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsca

le

Output size / intermediate size (S)

switch

Camdoop (no agg)

Camdoop

Switch 1 Gbps (bound)

No aggregation Full aggregation

Better

Worse

Performance independent of S

Impact of higher bandwidth

Facebook reported aggregation ratio

Impact of aggregation

Paolo Costa CamCube - Rethinking the Data Center Cluster 49/54

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsca

le

Output size / intermediate size (S)

switch

Camdoop (no agg)

Camdoop

Switch 1 Gbps (bound)

No aggregation Full aggregation

Better

Worse

Paolo Costa CamCube - Rethinking the Data Center Cluster 50/54

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsca

le

Output size / intermediate size (S)

switch

Camdoop (no agg)

Camdoop

Switch 1 Gbps (bound)

No aggregation Full aggregation

Better

Worse

Impact of aggregation

Impact of higher

bandwidth

Paolo Costa CamCube - Rethinking the Data Center Cluster 51/54

0

10

20

30

40

50

60

20 KB 200 KB

Tim

e (

ms)

Input data size / server

switch

Camdoop (no agg)

Camdoop

Better

Worse

Paolo Costa CamCube - Rethinking the Data Center Cluster 52/54

0

10

20

30

40

50

60

20 KB 200 KB

Tim

e (

ms)

Input data size / server

switch

Camdoop (no agg)

Camdoop

Better

Worse

In-network aggregation beneficial also for

low-latency services

Paolo Costa CamCube - Rethinking the Data Center Cluster 53/54

• Building large-scale services is hard – mismatch between physical and logical topology – the network is a black box

• CamCube – designed from the perspective of developers – explicit topology and key-based abstraction

• Benefits simplified development more efficient applications and better fault-tolerance

H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. “Symbiotic

Routing for Future Data Centres”, In ACM SIGCOMM, 2010. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. “Camdoop: Exploiting In-network Aggregation for Big Data Applications”, In USENIX NSDI, 2012.

Paolo Costa http://www.doc.ic.ac.uk/~costa