67
23.9.20 T-110.5116 Computer Networks II Data center networks 1.12.2012 Matti Siekkinen (Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )

T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

  • Upload
    doque

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

• 23.9.2010

T-110.5116 Computer Networks II Data center networks 1.12.2012 Matti Siekkinen

(Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )

Page 2: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 2

Page 3: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What is a data center?

•  Contains servers and data •  Has a network

–  Connect servers together

•  Runs applications and services –  Internal and external

•  Centrally managed •  Operated in controlled environment •  Can have very different sizes

–  SME datacenter vs. Google

• 3

Page 4: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Applications and services

•  External facing –  Search, Mail, Shopping Carts, …

•  Internal to the company/institution –  E.g. ERP (Financial, HR, …)

•  Services internal to the data center –  Those necessary for the data center to work

•  E.g. network operations (DNS, NFS, DHCP), backup –  Building blocks for external facing apps

•  MapReduce, GFS, BigTable (Google), Dynamo (Amazon), Hadoop (Yahoo!), Dryad (Microsoft)

•  Often distributed

• 4

Page 5: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Multi-tier architecture

•  E.g. 3-tiers –  Front end servers –  Applications servers –  Backend database servers

•  Advantages –  Performance & scalability –  Security

• 5

Handles static requests

Handles dynamic content

Handles database transactions

Page 6: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What does it look like? •  Servers in racks

–  Contains commodity servers (blades) –  Connected to Top-Of-Rack switch –  Aggregated traffic to next level

•  Modular data centers –  Shipping containers full of racks

Inside a container From Microsoft Chicago data center

• 6

Page 7: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Large data center requires a lot of...

•  Some statistics –  Google: 450,000 servers in 2006, estimated

over a million by now –  Microsoft is doubling the number of servers

every 14 months

• 7

Power

Cooling

Photos from Microsoft Chicago data center

Page 8: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Cloud computing

•  Cloud computing –  Abstract underlying resouces from the service provided –  Abstraction on different levels: IaaS, PaaS, SaaS

•  Virtualization enables cloud’s many properties –  Elastic resource allocation

•  Of course limited by number of physical servers •  One users resources limited by SLA, not by single piece of

hardware –  Efficient use of resources

•  Don’t need to run all servers full speed all the time •  Client’s VMs can run on any physical server

• 8

Page 9: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Data center vs. Cloud

•  Data center is physical –  Physical infrastructure that runs services

•  Cloud is not physical –  Offers some service(s) –  Physical infrastructure is virtualized away

•  Cloud usually needs to be hosted in a data center –  Depends on scale

•  Data center does not need to host cloud services •  Private cloud vs. own data center

–  Not the same thing

• 9

Page 10: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Cloud DC vs. Enterprise DC

•  Traditional enterprise DC: IT staff cost dominates –  Human to server ratio: 1:100 –  Less automation in management –  Scale up: a few high priced servers –  Cost borne by the enterprise

•  Utilization is not critical

•  Cloud service DC: other costs –  Human to server ratio: 1:1000 –  Automation is more crucial –  Distributed workload, spread out on lots of commodity servers –  High upfront cost amortized over time and use –  Pay per use for customers

•  Utilization is critical

• 10

Page 11: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What is a data center network (DCN)?

•  Enables communication within DC –  Among the different servers

•  In practice –  HW: switches, routers, and cabling –  SW: communication protocols (layers 2-4)

•  Principles evolved from enterprise networks

• 11

Page 12: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What is a data center network (DCN)?

•  Both layers 2 (link) and 3 (network) present –  Not only L3 routers but also L2 switches –  Layer 2 subnets connected with layer 3

•  Layer 4 (transport) needed similar to any packet networks

•  Note: does not have to be TCP/IP! –  Not part of routed Internet

•  Cannot resolve DC server’s address directly from Internet, only front end servers

–  But often is TCP/IP…

• 12

email WWW phone..."

SMTP HTTP SIP..."

TCP UDP…"

IP"

Eth PPP WiFi 3GPP…"

copper fiber radio OFDM FHSS..."

Page 13: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What makes DCNs special?

•  Just plug all servers to an edge router and be done with it? –  Several issues with this approach

•  Scaling up capacity –  Lots of servers need lots of switch ports –  E.g.: State of the art Cisco Nexus 7000 modular data center switch (L2

and L3) supports max. 768 1/10GE ports •  Switch capacity and price

–  Prices goes up with nb of ports –  E.g.: List price for 768 ports with 10GE modules somewhere beyond $1M –  Buying lots of commodity switches is an attractive option

•  Potentially majority of traffic stays within DC –  Server to server

• 13

Page 14: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

What makes DCNs special? (cont.)

•  Requirements different from Internet applications –  Large amounts of bandwidth –  Very, very short delays –  Still, often Internet protocols (TCP/IP) used

•  Management requirements –  Incremental expansion –  Should be able to withstand server failures, link outages, server

rack failures •  Under failures, performance should degrade gracefully

•  Requirements due to expenses –  Cost-effectiveness; high throughput per dollar –  Power efficiency

⇒ DCN topology and equipment matter a lot

• 14

Page 15: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Data Center Costs

Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network Switches, links, transit

•  Total cost varies –  Upwards of $1/4 B for mega data center

•  Server costs dominate –  Network costs also significant

⇒ Network should allow high utilization of servers

Source: Greenberg et al. The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

• 15

Page 16: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 16

Page 17: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Switch vs. router: What’s the difference?

•  Switch is layer 2 device –  Does not understand IP protocol –  Does not run any routing protocol

•  Router is layer 3 device –  “Speaks” IP protocol –  Runs routing protocols to determine shortest paths

•  OSPF, RIP, etc.

•  Terminology not so clear –  L2/3, i.e. multi-layer switches

• 17

Page 18: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Switch vs. router: Difference in basic functioning •  Router

–  Forwards packets based on destination IP address •  Prefix lookup against routing tables

–  Routing tables built and maintained by routing algorithms and protocols

•  Protocols exchange information about paths to known destinations •  Algorithms compute shortest paths based on this information

–  Broadcast sending usually not allowed •  Switch

–  Forwards frames (packets) based on destination MAC address –  Uses switch table

•  Equivalent to routing table in router –  Broadcast sending is common –  How is switch table built and maintained since there is no routing

protocol?

• 18

Page 19: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Switch is self learning •  When frame is received from one port

–  Switch learns that sender is behind that port –  Switch adds that information to switch table –  Soft state: forget after a while

•  If destination not (yet) known –  Flood to all other ports

•  Flooding can lead to forwarding loops –  Switches connected in cyclic manner –  These loops can create broadcast storms

•  Spanning tree protocol (STP) used to avoid loops –  Generates loop-free topology –  Avoid using some ports when flooding –  Rapid Spanning Tree Protocol (RSTP)

•  Faster convergence after a topology change

• 19

AA

Port 1

Hub

Port 1

Hub

Port 2 Port 2 AA 1 AA 1

BB

CC DD

AA 2 AA 2 AA 1 AA 1

And so on… No TTL in L2 headers!

<Src=AA, Dest=DD>

Page 20: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Layer 2 vs. Layer 3 in DCN

•  Management –  L2 close to plug-and-play –  L3 usually requires some manual configuration (subnet mask, DHCP)

•  Scalability and performance –  L2 broadcasting and STP scale poorly –  L2 forwarding less scalable than L3 fwding

•  L2 based on flat MAC addresses •  L3 based on hierarchical IP addresses (prefix lookup)

–  L2 has no such load balancing over multiple paths as L3 –  L2 loops may still happen in practice, even with STP

• 20

Page 21: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Layer 2 vs. Layer 3 in DCN

•  Flexibility –  VM migration may require change of IP address in L3 network

•  Need to conform to subnet address –  L2 network allows any IP address for any server

•  Some reasons may prevent using pure L3 design –  Some servers may need L2 adjacency

•  Servers performing the same functions (load balancing, redundancy) •  Heartbeat or application packets may not be routable

–  Dual homed servers may need to be on same L2 domain •  Connected to two different access switches •  Some configurations require both primary and secondary to be in same L2

domain

• 21

Page 22: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

VLAN

•  VLAN = Virtual Local Area Network •  Some servers may need to belong to same L2 broadcast

domains –  See previous slide…

•  VLANs overcome limitations of physical topology –  Run out of switch ports

•  VLAN allows flexible growth while maintaining layer 2 adjacency –  L2 domain across routers

•  VLAN can be port-based or @MAC-based

• 22

Page 23: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Port-based VLAN

•  Traffic isolation –  Frames to/from ports 1-8 can

only reach ports 1-8 –  Can also define VLAN based on

MAC addresses of endpoints, rather than switch port

•  Dynamic membership –  Ports can be dynamically

assigned among VLANs •  Forwarding between VLANS

done via routing

• 23

1

8

9

16 10 2

7

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

15

router

Page 24: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

VLANs spanning multiple switches

•  VLANs can span over multiple switches •  Also over different routed subnets

–  Routers in between

1

8

9

10 2

7

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

15

2

7 3

Ports 2,3,5 belong to VLAN1 Ports 4,6,7,8 belong to VLAN2

5

4 6 8 16

1

• 24

Page 25: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 25

Page 26: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Design Alternatives for DCN

Two high level choices for Interconnections: •  Specialized hardware and communication protocols

–  E.g. Infiniband seems common – 

•  Can provide high bandwidth & extremely low latency –  Custom hardware takes care of some reliability tasks

•  Relatively low power physical layer – 

•  Expensive •  Not natively compatible with TCP/IP applications

•  Commodity (1/10 Gb) Ethernet switches and routers –  Compatible –  Cheaper –  We focus on this

• 26

Page 27: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Conventional DCN architecture

•  Topology: Two- or three-level trees of switches or routers –  Multipath routing –  High bandwidth by

appropriate interconnection of many commodity switches

–  Redundancy

Internet

Layer-3 router

Layer-2/3 aggregation switches Layer-2 Top-Of-Rack access switches

Servers

• 27

Page 28: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Issues with conventional architecture

•  Bandwidth oversubscription –  Total bandwidth at core/aggregate level less than summed up

bandwidth at access level –  Limited server to server capacity –  Application designers need to be aware of limitations

•  No performance isolation –  VLANs typically provide reachability isolation only –  One server (service) sending/receiving too much traffic hurts all

servers sharing its subtree

•  There are more…

• 28

Page 29: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

One solution to oversubscription

•  FAT Tree topology with special look-up scheme –  Add more commodity switches

•  Carefully designed topology •  All ports have same capacity as servers

–  Enables •  Full bisection bandwidth •  Lower cost because all switch ports have

same capacity –  Drawbacks

•  Need customized switches –  Special two level look-up scheme to

distribute traffic •  Lot of cabling

• 29

M. Al-Fares et al. Commodity Data Center Network Architecture. In SIGCOMM 2008.

Core Switches

Aggregation Switches

Edge Switches

FAT Tree

Page 30: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

One solution to performance isolation: VLB •  Random flow spreading with Valiant Load Balancing (VLB)

–  Similar FAT Tree topology with commodity switches –  Every flow “bounced” off a random intermediate switch –  Provably hotspot free for any admissible traffic matrix –  No need to modify switches (std forwarding)

•  Relies on ECMP and clever addressing –  Requires some changes to servers

• 30

10G D/2 ports

D/2 ports

. . .

. . . D switches

D/2 switches Intermediate node switches in VLB

D ports

Top Of Rack switch

[D2/4] * 20 Servers 20

ports

Aggregation switches

A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.

Page 31: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

DCN architectures in research

•  Lots of alternative proposed architectures in recent years •  Goals

–  Overcome limitations of typical architectures of today –  Use commodity standard equipment

•  VL2 & Monsoon & CamCube (MSR) •  Portland (UCSD) •  Dcell & Bcube (MSR, Tsinghua, UCLA) •  …

• 31

Page 32: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 32

Page 33: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

TCP in the Data Center

•  TCP rules as transport inside DC –  99.9% of traffic

•  DCNs different environment for TCP compared to normal Internet e2e transport –  Very short delays –  Specific application workloads

•  How well does TCP work in DCNs? –  Several problems…

• 33

Page 34: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Worker Nodes

Partition/Aggregate Application Structure

• 34

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Internet

•  The foundation for many large-scale web applications –  Web search, Social network

composition, Ad selection, etc. •  Time is money -> strict deadlines •  Missed deadline means lower

quality result

Page 35: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Worker Nodes

Partition/Aggregate Application Structure

• 35

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Internet

•  Deadlines in lower hierarchy must meet with all-up deadline

•  Iterative requests common •  1-4 iterations typical •  Workers have tight deadlines

•  99.9th percentiles of delay matter for companies •  1 out of 1000 responses •  Can potentially impact large number of customers

Page 36: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Workloads

•  Query-response traffic –  Partition/Aggregate –  Part of the “mice” flows

•  Background traffic –  Short messages [50KB-1MB]

•  Coordination, control state •  Part of the “mice” flows

–  Large flows [1MB-50MB] •  Updating data on each server •  The “elephant” flows

•  Problem: –  All this traffic goes through same switches –  Requirements are conflicting

• 36

Requires minimal delay

Requires high throughput

Page 37: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Traffic patterns from one cluster of Microsoft’s DCN

ln(Bytes) exchanged per 10s

  Traffic exchanged between server pairs in 10s period

  Servers within a rack are adjacent on axis

  Work-Seeks-Bandwidth (W-S-B)   Small squares around

diagonal   Scatter-Gather (S-G)

  Horizontal and vertical lines

Page 38: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Traffic patterns from one cluster of Microsoft’s DCN (cont.) •  Work-seeks-bandwidth

–  Need to make efforts to place jobs under the same ToR

•  Scatter-gather-patterns –  Server pushes/pulls data to/from many servers across the

cluster –  Distributed query processing: map, reduce

•  Data divided into small parts •  Each servers works on particular part •  Answers aggregated

–  Need for inter-ToR communication •  Computation constrained by the network

Page 39: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

DCN characteristics

•  Network characteristics –  Large aggregate bandwidths –  Very short round trip time delays (<1ms)

•  Typical switches –  Use large numbers of commodity switches –  Typically commodity switch has shared memory

•  Common memory pool for all ports –  Why not separated memory spaces?

•  Cost issue for commodity switches

• 39

Page 40: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Resulting problems with TCP in DCN

•  Incast

•  Queue Buildup

•  Buffer Pressure

• 40

Page 41: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Problems: Incast

• 41

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

•  Synchronized mice collide.   Caused by Partition/Aggregate

Page 42: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Incast

•  What happens next? –  TCP timeout –  Default minimum values of timeout 200-400ms depending on

OS

•  Why is that a major problem? –  Several order of magnitude longer than RTT -> huge penalty –  Fail to meet deadlines in all levels

• 42

Page 43: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Problems: Incast

• 43

A TCP timeout

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

RTOmin = 300 ms

Page 44: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Problems: Queue Buildup

•  Remember the different workloads –  Small “mice” flows –  Large “elephant” flows

•  Large flows can eat up the shared buffer space –  Same outgoing port

•  Result is similar than with incast

• 44

Page 45: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Problems: Queue Buildup

Sender 1

Sender 2

Receiver

Big flows build up queues   Increased latency for short flows  Packet loss

• 45

Page 46: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Problems: Buffer pressure

•  Kind of generalization of the previous problem •  Increased queuing delay and packet loss due to long

flows traversing other ports –  Shared memory pool –  Packets incoming and outgoing different ports still eat up each

common buffer space

• 46

Page 47: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 47

Page 48: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Data Center Transport Requirements

• 48

1.  High Burst Tolerance –  Cope with the Incast problem

2.  Low Latency –  Short flows, queries

3. High Throughput –  Continuous data updates, large file transfers

We want to achieve all three at the same time

Page 49: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Exploring the solution space Proposal Throughput Burst tolerance

(Incast) Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin :/ No major impact Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

• 49

Proposal Throughput Burst tolerance (Incast)

Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin :/ No major impact Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

Proposal Throughput Burst tolerance (Incast)

Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin :/ No major impact Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

Page 50: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Jittering

•  Add random delay before responding –  Desynchronize the responding sources to avoid buffer overflow

•  Jittering trades off median against high percentiles

MLA

Que

ry C

ompl

etio

n Ti

me

(ms)

Jittering off Jittering on

Requests are jittered over 10ms window

• 50

Page 51: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Proposal Throughput Burst tolerance (Incast)

Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin :/ No major impact Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

Proposal Throughput Burst tolerance (Incast)

Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin Improves throughput

Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

Proposal Throughput Burst tolerance (Incast)

Latency

Deep switch buffers Can achieve high throughput

Tolerates large bursts

Queuing delays increase latency

Shallow buffers Can hurt throughput of elephant flows

Cannot tolerate bursts well

Avoids long queuing delay

Jittering :/ No major impact Prevents Incast Increases median latency

Shorter RTOmin Improves throughput

Helps recover faster

Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

High throughput with high utilization

 Helps in most cases Problem if only 1 pkt is too much

Reacts early to queue buildup

Exploring the solution space

• 51

Page 52: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Review: TCP with ECN

• 52

Sender 1

Sender 2

Receiver ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Q: How do TCP senders react? A: Cut sending rate by half

Page 53: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

DCTCP: Two key ideas

1.  React in proportion to the extent of congestion, not just its presence   Reduces variance in sending rates, lowering queuing requirements

2.  Mark based on instantaneous queue length   Fast feedback to better deal with bursts

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

Q: Why normal TCP with ECN does not behave like DCTCP? A: Fairness…

• 53

Page 54: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Data Center TCP Algorithm

•  Switch side: –  Mark packets when Queue Length > K

•  Sender side: –  Maintain moving average of fraction of packets marked (α). –  In each RTT:

•  Adaptive window decreases: –  Note: decrease factor between 1 and 2.

• 54

KMark Don’t mark

Page 55: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

(Kby

tes)

DCTCP in Action

• 55

Page 56: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Why does DCTCP work?

•  High Burst Tolerance –  Aggressive marking → sources react before packets are

dropped –  Large buffer headroom → bursts fit

•  Low Latency –  Small buffer occupancies → low queuing delay

•  High Throughput –  ECN averaging → smooth rate adjustments, low variance –  Leads to high utilization

• 56

Page 57: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Completely solves the Incast problem?

•  Remember Incast: large number of synchronized small flows hit the same queue

•  Depends on the number of small flows –  Does not help if so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst •  No congestion control helps •  Only solution is to somehow schedule responses (e.g. jittering)

•  Helps if each flow has several packets to transmit –  Windows build up over multiple RTTs –  Bursts in subsequent RTTs would lead to packet drops –  DCTCP sources receive enough ECN feedback to prevent

buffer overflows

• 57

Page 58: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Comparing TCP and DCTCP

•  Emulate traffic within 1 Rack of Bing cluster –  45 1G servers, 10G server for external traffic

•  Generate query, and background traffic –  Flow sizes and arrival times follow distributions seen in Bing

•  Metric: –  Flow completion time for queries and background flows

•  RTOmin = 10ms for both TCP & DCTCP –  More than fair comparison

• 58

Page 59: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

• 59

Page 60: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

Low latency for short flows

• 60

Page 61: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

High throughput for long flows

• 61

Page 62: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

High burst tolerance for query flows

• 62

Page 63: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

DCTCP summary

•  DCTCP –  Handles bursts well –  Keeps queuing delays low –  Achieves high throughput

•  Features: –  Simple change to TCP and a single switch parameter –  Based on existing mechanisms

• 63

Page 64: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

TCP for DCN research

•  Data transport in DCN has received attention recently •  Several solutions proposed just this year

–  Deadline-Aware Datacenter TCP (D2TCP) (Purdue, Google) –  DeTail (cross layer solution) (Berkeley, Facebook) –  …

• 64

Page 65: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 65

Page 66: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Wrapping up

•  Data center networks provide specific networking challenges –  Potentially huge scale –  Different requirements than with traditional Internet applications

•  Recently a lot of research activity –  New proposed architectures and protocols –  Big deal to companies with mega-scale data centers: $$

•  Popularity of cloud computing accelerates this evolution

• 66

Page 67: T-110.5116 Computer Networks II - Aalto University · M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” ) Outline •

Want to know more?

1.  M. Arregoces and M. Portolani. Data Center Fundamentals. Cisco Press, 2003. 2.  Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. 2009. The nature of data center

traffic: measurements & analysis. In Proceedings of IMC 2009. 3.  Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G. A.,

and Mueller, B. 2009. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009.

4.  A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. 5.  C. Guo et al. DCell: A Scalable and Fault Tolerant Network Structure for Data Centers. In SIGCOMM,

2008. 6.  M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture.

In Proceedings of the ACM SIGCOMM 2008. 7.  Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S.,

Subramanya, V., and Vahdat, A. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM 2009.

8.  Joseph, D. A., Tavakoli, A., and Stoica, I. 2008. A policy-aware switching layer for data centers. In Proceedings of the ACM SIGCOMM 2008.

9.  Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009.

10.  Abu-Libdeh, H., Costa, P., Rowstron, A., O'Shea, G., and Donnelly, A. 2010. Symbiotic routing in future data centers. In Proceedings of the ACM SIGCOMM 2010.

11.  Check SIGCOMM 2012 program as well

• 67