116
Condor Better Topologies through Declarative Design Brandon Schlinker 1,2 Radhika Niranjan Mysore 1 , Sean Smith 1 , Jeffrey C. Mogul 1 , Amin Vahdat 1 , Minlan Yu 2 , Ethan Katz-Bassett 2 , and Michael Rubin 1 1 Google Inc., 2 University of Southern California SIGCOMM 2015 1

Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Embed Size (px)

Citation preview

Page 1: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

CondorBetter Topologies through Declarative Design

Brandon Schlinker1,2

Radhika Niranjan Mysore1, Sean Smith1, Jeffrey C. Mogul1, Amin Vahdat1, Minlan Yu2,Ethan Katz-Bassett2, and Michael Rubin1

1 Google Inc., 2 University of Southern California

SIGCOMM 2015

1

Page 2: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Our Focus:

Designing Topologies for Large Datacenters

2

Page 3: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Our Focus:

Designing Topologies for Large Datacenters

(10k - 100k+ servers)

3

Page 4: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Designing Topologies for Large Datacenters

Why exploring the topology design space is hardHow today’s ad-hoc approach leads to suboptimal deployments

How Condor’s systematic approach helps architectsEnabling efficient exploration and evaluation of the design space

4

Page 5: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Designing Topologies for Large Datacenters

Why exploring the topology design space is hardHow today’s ad-hoc approach leads to suboptimal deployments

How Condor’s systematic approach helps architectsEnabling efficient exploration and evaluation of the design space

5

Page 6: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

6

NetworkArchitect

“We need a datacenter for 50k servers”

Page 7: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

7

“We need a datacenter for 50k servers”

NetworkArchitect

Let’s use a Fat-tree topology!(SIGCOMM 2008)

Page 8: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

8

Each fat-tree contains 2+ pods

Pod 2 Pod 3 Pod 4Pod 1

Pod PodPod Pod

Page 9: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

9

Pod 2 Pod 3 Pod 4Pod 1

Each pod contains:Servers

CPU

Page 10: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

10

Pod 2 Pod 3 Pod 4Pod 1

ToRs

Each pod contains:Servers

Top of Rack Switches

Page 11: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

11

Pod 2 Pod 3 Pod 4Pod 1

ToRs

Aggs

Each pod contains:Servers

Top of Rack SwitchesAggregation Switches

Page 12: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

12

Pod 2 Pod 3 Pod 4Pod 1

ToRs

Aggs

Pods are connected by Spine Switches

Spines

Page 13: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

The Fat-tree Topology

13

Pod 2 Pod 3 Pod 4Pod 1

ToRs

Aggs

Done.The topology is deployed.

Spines

Page 14: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Up and Running...

14

Months go by…The network runs fine most of the time…

Page 15: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Up and Running...

15

Months go by…The network runs fine most of the time…

But when a failure happensrecovery takes longer than expected

Page 16: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod NPod N-1Pod 2Pod 1

Hindsight is 20/20

16

...

Architect goes back and reviews the design…

Fat-tree topology | SIGCOMM 2008

Page 17: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod NPod N-1Pod 2Pod 1

Hindsight is 20/20

17

...

Pod NPod N-1Pod 2Pod 1

...

Fat-tree topology | SIGCOMM 2008 F10 topology | NSDI 2013

And discovers a small wiring change (F10)that allows faster recovery from failures

Page 18: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod NPod N-1Pod 2Pod 1

18

...

Pod NPod N-1Pod 2Pod 1

...

Fat-tree topology | SIGCOMM 2008 F10 topology | NSDI 2013

Why is F10’s wiring better?

Page 19: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod 2

43

F10’s Hidden Advantage

19

F10 topology | NSDI 2013

Pod 1

21

...

Pod 2

43

Pod 1

2

...

Fat-tree topology | SIGCOMM 2008

1

F10’s small change improves the connectivity between pods

Page 20: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10’s Hidden Advantage

20

Pod 2

43

Pod 1

21

...

Fat-tree topology | SIGCOMM 2008

Page 21: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10’s Hidden Advantage

21

Pod 2

43

Pod 1

21

...

Pod 2

34

Pod 1

1

Resulting Inter-Pod Connectivity

Fat-tree topology | SIGCOMM 2008

Page 22: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10’s Hidden Advantage

22

Pod 2

43

Pod 1

21

...

Pod 2

43

Pod 1

2

...

1

Pod 2

34

Pod 1

1

Resulting Inter-Pod Connectivity

F10 topology | NSDI 2013Fat-tree topology | SIGCOMM 2008

Page 23: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10’s Hidden Advantage

23

Pod 2

43

Pod 1

21

...

Pod 2

43

Pod 1

2

...

1

Pod 2

34

Pod 1

1Pod 2

34

Pod 1

Resulting Inter-Pod Connectivity Resulting Inter-Pod Connectivity

1

F10 topology | NSDI 2013Fat-tree topology | SIGCOMM 2008

Page 24: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10’s Hidden Advantage

24

F10’s wiring = two paths(via agg3 and agg4)

Standard wiring = one path(via agg3)

Pod 2

34

Pod 1

1Pod 2

34

Pod 1

Resulting Inter-Pod Connectivity Resulting Inter-Pod Connectivity

1

F10’s simple wiring change providesgreater path diversity, faster recovery from failure

Page 25: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

25

Oops.

Page 26: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

26

Oops.

Page 27: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

27

These late realizations occur all too often

Why?

Page 28: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

28

Architects make lots of decisions during design,every decision can have substantial impact

Page 29: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

29

But it’s difficult to evaluate if a decision is goodso architects are often blind to design defects

Architects make lots of decisions during design,every decision can have substantial impact

Page 30: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

30

examples:≥ 100 Mbps throughput 99.99% of the time

≤ 10 ms latency 99.99% of the time

One of architect’s primary concerns isService Level Objective (SLO) Compliance

Page 31: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

31

examples:≥ 100 Mbps throughput 99.99% of the time

≤ 10 ms latency 99.99% of the time

One of architect’s primary concerns isService Level Objective (SLO) Compliance

Tough to estimate and compare SLO compliance

Page 32: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Estimating SLO compliance of two designs

32

Rack

Switch

...Rack Rack Rack

LC LC LC LC

LC LC

Switch

LC LC LC LC

LC LC

Switch

LC LC LC LC

LC LC

Switch

LC LC LC LC

LC LC

Example Topology:Bunch of racks interconnected with big switches

Page 33: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Switch

LC LC LC LC

LC LC

Rack

Challenge: Estimating SLO compliance of two designs

33

A

Option A:Connect each ToR to two linecards

in each fabric switch

Page 34: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Switch

LC LC LC LC

LC LC

Rack

Challenge: Estimating SLO compliance of two designs

34

A

Option A:Connect each ToR to two linecards

in each fabric switch

LC LC LC

LC LC

RackB

LC

Option B:Connect ToR to same linecard twice

in each fabric switch

Switch

Page 35: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Estimating SLO compliance of two designs

35

Rack loses 1/8th of inter-rack throughput(1 out of 8 links lost)

Switch

LC LC LC LC

LC LC

RackA

Switch

LC LC LC

LC LC

RackB

LC

Rack loses 1/4th of inter-rack throughput(2 out of 8 links lost)

Linecard Failure Linecard Failure

Page 36: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Estimating SLO compliance of two designs

36

Rack loses 1/8th of inter-rack throughput(1 out of 8 links lost)

Switch

LC LC LC LC

LC LC

RackA

LC LC LC

LC LC

RackB

LC

Rack loses 1/4th of inter-rack throughput(2 out of 8 links lost)

Linecard Failure Linecard Failure

Switch

Page 37: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Estimating SLO compliance of two designs

37

Routing reconvergenceand packet loss

Switch

LC LC LC LC

LC LC

RackA

LC LC LC

LC LC

RackB

LC

Local failure handlingand no packet loss

Link Failure Link Failure

Switch

Page 38: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Estimating SLO compliance of two designs

38

Routing reconvergenceand packet loss

Switch

LC LC LC LC

LC LC

RackA

LC LC LC

LC LC

RackB

LC

Local failure handlingand no packet loss

Link Failure Link Failure

Switch

Page 39: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Design A Design B

Line Card Failure

Link Failure

39

Is either design SLO compliant?Which design is better?

Page 40: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Design A Design B

Line Card Failure

Link Failure

40

Is either design SLO compliant?Which design is better?

Depends on SLO, failure / recovery rates, routing protocolscannot evaluate via simple calculations

Page 41: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

41

Depends on SLOs, failure and recovery rates, protocols

cannot evaluate via simple calculations

architects are often blind to design defects

Page 42: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

42

Page 43: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Designing Expandable Topologies

43

10k serversToday

+10k serversDec. 2015

Large datacenters are expanded incrementallyResources added as needed to reduce cost, depreciation

+10k serversJune 2016

Page 44: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Challenge: Designing Expandable Topologies

44

Challenges of Incremental Expansions

Expansions often require rewiring existing connectionsRewiring operations must be performed on live network

Page 45: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Example: Expanding a Fat-tree Topology

45

Pod 1 Pod 2 Pod 3 Pod 4

1) Add new pods and spines

Redistributed link Added linkOriginal link

2) Redistribute links from existing pods across spines3) Connect new pods to spines

Page 46: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Example: Expanding a Fat-tree Topology

46

Is it possible to rewire this topologywithout impacting production traffic?

Page 47: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Example: Expanding a Fat-tree Topology

47

Just rewire one cable at a time!little risk, easy to plan, but slow

Page 48: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Example: Expanding a Fat-tree Topology

48

Rewire multiple cables at a time!faster, but more risk, difficult to plan

Page 49: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

49

Rewire multiple cables at a time!faster, but more risk, difficult to plan

Which design is better for SLO compliance?cannot evaluate via simple calculations

Architects need good tools to evaluate designs

Page 50: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

50

But today, evaluating a design’s utilityis a human-intensive process

Page 51: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

implement, debugcomplex algorithms

Today’s Design Cycle: Abstract to Concrete Model

AbstractDesign

51

ManualSynthesis

TopologyModel

Generating a concrete model of wiring requires

implementing algorithms

Page 52: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

eyeballingad-hoc calculations

Today’s Design Cycle: Concrete Model to Utility

52

ManualAnalysis

TopologyModel

Evaluating utility of a design is often done by

eyeballing and ad-hoc calculations

UtilityEvaluation

CostExpandability

SLO Compliance

Page 53: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Today’s Human-Intensive Design Cycle

53

Utility

This human-intensive design cycle

ManualAnalysis

Design

leads to slow, cursory evaluations

ModelManual

Synthesis

Page 54: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Slow, cursory evaluation = big problem

54

Page 55: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Slow, cursory evaluation = big problem

Because to find the best design architects need:- deep insight into design’s utility

- to be able to quickly explore variants...

55

Page 56: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

56

But if we had better tools

We could explore the design space more efficiently

Page 57: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

57

CondorEnables rapid exploration of the design space

Page 58: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Condor’s Design Cycle: Abstract to Concrete Model

Design

58

Model

write algorithms

ManualSynthesis

Page 59: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Condor’s Design Cycle: Abstract to Concrete Model

Design

59

Model

Topology Description Language

architects describe designs declaratively

write algorithms

ManualSynthesis

Page 60: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Condor’s Design Cycle: Abstract to Concrete Model

Design

60

Model

Topology Description Language

architects describe designs declaratively

Synthesis Engine

converts TDL into constraint satisfaction problem, builds model

write algorithms

ManualSynthesis

Page 61: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

ManualAnalysis

Condor’s Design Cycle: Concrete Model to Utility

61

Model Utility

CostExpandability

SLO Complianceeyeballing / ad-hoc

Page 62: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

ManualAnalysis

Condor’s Design Cycle: Concrete Model to Utility

62

Model Utility

CostExpandability

SLO Complianceeyeballing / ad-hoc

Analysis Engine

automated utility analysisagainst architect workload

Page 63: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Condor Enables Aggressive Exploration

63

Utility

Condor pipeline enables aggressive exploration of the design space

CondorDesign

> TDL declarative design

automated synthesis

automated analysis

Page 64: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

64

Designing a Fat-tree with Condor’s TDL

Page 65: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing Fat-tree Topology with Condor’s TDL

65

Step 1: Define hierarchical building blocksswitches, linecards, racks, and parent-child relationships between them

Spine

PodAgg

ToR

Fat-tree

Pod Pod

Hierarchical Building BlocksAbstract Design

Page 66: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

class FatTree extends TDLBuildingBlock: agg = new Switch10GbE(num_ports, "agg") tor = new Switch10GbE(num_ports, "tor")

pod = new TDLBuildingBlock("pod") pod.Contains(agg, num_sw_per_tier) pod.Contains(tor, num_sw_per_tier) ...

66

Fat-treebuilding

blockpodbuilding

block

Describing Fat-tree Building Blocks with Condor’s TDL(switches, linecards, racks, and the relationships between them)

agg/ToRbuildingblocks

PodToRs

Aggs

Page 67: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

spinebuildingblocks

class FatTree extends TDLBuildingBlock: ... spine = new Switch10GbE(num_ports, "spine")

Contains(spine, num_spines) Contains(pod, num_pods)

67

Fat-treebuilding

block

Describing Fat-tree Building Blocks with Condor’s TDL(switches, linecards, racks, and the relationships between them)

Fat-tree containsspines and pods

Fat-treeSpines

Pod Pod

Page 68: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing Fat-tree Topology with Condor’s TDL

68

Pod 1

Each pod contains:Servers

Top of Rack SwitchesAggregation Switches

Building blocks are naturally defined

Pod 2 Pod 3 Pod 4

agg = new Switch10GbE(...)tor = new Switch10GbE(...)

pod = new Block("pod")pod.Contains(agg, ...)pod.Contains(tor, ...)

Page 69: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing FatTree Connectivity with Condor’s TDL(constraints on connectivity between components)

69

Pod

Abstract Designin every pod, every agg connects to every ToR

Step 2: Define constraints on connectivity between building blocks

pod.ConnectPairsWithXLinks(agg, tor, 1) TDL Constraint

Page 70: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing FatTree Connectivity with Condor’s TDL(constraints on connectivity between components)

70

Abstract Design

Step 2: Define constraints on connectivity between building blocks

ConnectPairsWithXLinks(spine, pod, 1) TDL Constraint

Pod

in every fat-tree, every spine connects to every pod

Page 71: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

71

Done.

pod.ConnectPairsWithXLinks(agg, tor, 1)ConnectPairsWithXLinks(spine, pod, 1)

Page 72: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

72

Done.

pod.ConnectPairsWithXLinks(agg, tor, 1)ConnectPairsWithXLinks(spine, pod, 1)

Just two TDL constraints gives us a

fat-tree’s connectivity

Page 73: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing Fat-tree Topology with Condor’s TDL

73

Fat-tree in < 15 lines of Condor’s TDL> TDL

Condor’s Automated Synthesizer

Concrete fat-tree topologyModel

Page 74: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Benefit of Condor’s TDL Constraints

Single Design

74

Code for Manual Synthesis

If we had written code to build a fat-tree,it could only generate a single solution

Page 75: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Benefit of Condor’s TDL Constraints

Solution Space

75

Fat-tree inCondor TDL

But the constraints we defined with TDL canbe satisfied by multiple solutions

> TDL

Page 76: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod 4Pod 3Pod 2

Benefit of Condor’s TDL Constraints

76

Pod 1

The canonical wiring meets the constraints…

Page 77: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Pod 4Pod 3Pod 2

Benefit of Condor’s TDL Constraints

77

Pod 1

This wiring also meets the constraints…

Page 78: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

78

And F10 meets these constraints

To describe F10 with Condor’s TDL,we add constraints to reduce the solution space

Page 79: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing F10 with Condor’s TDL

79

Recall: F10 increases inter-pod path diversitye.g.: # of paths from agg1 to pod2

F10’s wiring = two paths(via agg3 and agg4)

Standard wiring = one path(via agg3)

Pod 2

34

Pod 1

1Pod 2

34

Pod 1

1

Page 80: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Describing F10 with Condor’s TDL

80

Let’s try using the following constraint:Connect every agg sw to as many other agg sw as possible

Recall: F10 increases inter-pod path diversitye.g.: # of paths from agg1 to pod2

F10’s wiring = two paths(via agg3 and agg4)

Standard wiring = one path(via agg3)

Pod 2

34

Pod 1

1Pod 2

34

Pod 1

1

Page 81: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

81

We used this constraint to try to build F10

Page 82: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

82

We used this constraint to try to build F10

But instead we ended up improving F10Condor found better solutions than F10’s algorithm

Page 83: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Limitations of F10’s Imperative Algorithm

83

Pod 2

43

Pod 1

21

Pod 3

65

Pattern A Pattern B Pattern A

F10 generates and assigns two patterns of wiring: A and BOnly improves connectivity between pods with different patterns

Page 84: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Limitations of F10’s Imperative Algorithm

84

Pod 2

43

Pod 1

21

Pod 3

65

Pattern A Pattern B Pattern A

Different wiring pattern improved path diversity

Page 85: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Limitations of F10’s Imperative Algorithm

85

Pod 2

43

Pod 1

21

Pod 3

65

Pattern A Pattern B Pattern A

Same wiring pattern no path diversity

Page 86: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

86

Path diversity is limited by the number of wiring patternsand F10’s algorithm only produces two patterns

Page 87: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

87

Path diversity is limited by the number of wiring patternsand F10’s algorithm only produces two patterns

Probability that a pair of pods has better path diversity is

limited to 50%

Page 88: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Improving F10 with Condor

88

Pod 2

43

Pod 1

21

Pod 3

65

Pattern A Pattern B Pattern C

Condor gives three wiring patterns = maximum path diversity for 3 pod fat-tree

Page 89: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10 Algorithm

Improving F10 with Condor

89

# of pods

# of patterns

probability ofpath diversity

2 2 50%

Condor’s Solution

# of pods

# of patterns

probability ofpath diversity

2 2 50%

And with our TDL constraint + Condor’s synthesizerpath diversity increases as the topology grows

Page 90: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10 Algorithm

Improving F10 with Condor

90

# of pods

# of patterns

probability ofpath diversity

2 2 50%

4 2 50%

Condor’s Solution

# of pods

# of patterns

probability ofpath diversity

2 2 50%

4 3 66%

And with our TDL constraint + Condor’s synthesizerpath diversity increases as the topology grows

Page 91: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

F10 Algorithm

Improving F10 with Condor

91

# of pods

# of patterns

probability ofpath diversity

2 2 50%

4 2 50%

16 2 50%

Condor’s Solution

# of pods

# of patterns

probability ofpath diversity

2 2 50%

4 3 66%

16 9 88%

And with our TDL constraint + Condor’s synthesizerpath diversity increases as the topology grows

Page 92: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

92

With Condor, we improved F10

by systematically exploring the design space

Page 93: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

93

Could we have found these wirings by writing an algorithm?

Page 94: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

94

Could we have found these wirings by writing an algorithm?

Very unlikely.These patterns are instances of Balanced Incomplete Block Design (BIBD)

Why does BIBD matter?1) Synthesis of BIBDs is known to be difficult

2) We didn’t know about BIBD until after finding these solutions3) Unlikely that architect alone could come up with these designs

Page 95: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

95

What Topologies Does Condor’s TDL Support?

TDL supports any topology that you can describe withbuilding blocks and connectivity constraints

Page 96: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

96

Condor’s TDL supports flattened-butterfly topologies

Page 97: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

97

BCubeDCell

TDL supports recursive and random-graph topologies

Jellyfish

Page 98: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

98

And Condor synthesizes topologies quickly…

Fat-tree with 128k hosts in < 2 minutesDCell with 360k hosts in < 6 minutes

Page 99: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

After we’ve described and built a designCondor can help us evaluate its utility

99

UtilityCondorautomated analysis

Model

Page 100: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

100

Design’s utility ≠ how it performs in worst-caseIn large networks, worst-case scenarios are unlikely

Page 101: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

101

Design’s utility ≠ how it performs in worst-caseIn large networks, worst-case scenarios are unlikely

Traditional worst-case metrics less usefulbisection bandwidth

potential for partitions

Page 102: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

102

Instead, to evaluate a design’s utility, we focus onSLO Compliance

Particularly if it can carry an application’s traffic(e.g.: ≥ 100 Mbps throughput 99.99% of the time)

Page 103: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

103

SLO Compliance is Evaluated over aTopology’s Lifecycle

Page 104: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

104

SLO Compliance is Evaluated over aTopology’s Lifecycle

During failures and expansionswill this topology continue to meet its SLO?

Page 105: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

105

Estimating SLO Compliance: Architect Inputs

Architect characterizes workload as traffic matrixes

Workload(Traffic Matrixes)

Page 106: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

106

Architect characterizes reliability as failure and recovery rates

Estimating SLO Compliance: Architect Inputs

Workload(Traffic Matrixes)

Reliability Info

+ MTBF + MTTR> TDL

Page 107: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

107

Architect characterizes expansions as rewiring / recabling plan

Estimating SLO Compliance: Architect Inputs

Workload(Traffic Matrixes)

Reliability Info

+ MTBF + MTTR> TDL

+Procedure for

adding / changing cabling

Rewiring Plan

Page 108: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Estimating SLO Compliance: Failures

108

TopologyStates

MTBF + MTTRTopologyModel

Simulation of Failures + Recoveries

TopologyStates

Reliability Info

During failureswill this topology continue to meet its SLO?

Page 109: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Estimating SLO Compliance: Failures

109

TopologyStates

Reliability InfoTopology

Model

Simulation of Failures + Recoveries

% of time traffic demand is satisfied

TopologyStates

Traffic MatrixesThroughput Simulation

Page 110: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

110

During expansionswill this topology continue to meet its SLO?

Almost identical process

Page 111: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Estimating SLO Compliance: Expansions

111

TopologyStates

Traffic Matrixes

TopologyModel

Simulation of Expansion Operations

% of time traffic demand is satisfied

TopologyStates

Throughput Simulation

Rewiring Plan

Page 112: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Evaluating Online Expandability with Condor

112

Use Condor to evaluate tree expansion Max size = ~50k servers

Each increment = +6k serversRewiring = 32 operations per increment

> TDL

Check if design remained SLO compliant

Simulate 225 stage expansion operation(difficult without Condor)

Page 113: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

113

In the Paper Example of ExpansionsExpandability of two designs with a minor variation in TDL

Option A> TDL

Not SLO Compliant Remains SLO Compliant

Option B> TDL

Page 114: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

114

Condor helped us evaluate SLO compliance over the lifecycle of these topologies

Page 115: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Conclusion: The Condor Approach

Condor replaces a human-intensive design processwith a systematic approach to topology design

Condor’s TDL makes it easier to design topologiesCondor’s synthesizer quickly models topologies from TDL

Condor’s analysis engine provides insight into a design’s utility

115

Page 116: Condor - Computer Science · Manual Analysis Condor’s Design Cycle: Concrete Model to Utility 62 Model Utility ... Describing Fat-tree Topology with Condor’s TDL 68 Pod 1 Each

Conclusion: The Condor Approach

Condor replaces a human-intensive design processwith a systematic approach to topology design

More Examples in Paper:Evaluating SLO compliance of different designs

Other trade-offs in expansions (e.g.: imbalance in routing protocols)

TDL Source Code:http://nsl.cs.usc.edu/condor

116