15: Datacenter Design and Networking
Zubair Nabi
April 21, 2013
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 1 / 27
Outline
1 Datacenter Topologies
2 Transport Protocols
3 Network Sharing
4 Wrapping Up
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 2 / 27
Outline
1 Datacenter Topologies
2 Transport Protocols
3 Network Sharing
4 Wrapping Up
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 3 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the tree
Edge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Introduction
Datacenters are traditionally designed in the form of a 2/3-level tree
Switching elements become more specialized and faster when we goup the tree structure
A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports
I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches
Aggregation and core switches have 10Gbps ports
Partitioning if switches up the tree go down
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 5 / 27
Oversubscription
Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface
5:1 – Only 20% of the bandwidth is available (200Mbps)
Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27
Oversubscription
Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface
5:1 – Only 20% of the bandwidth is available (200Mbps)
Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27
Oversubscription
Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface
5:1 – Only 20% of the bandwidth is available (200Mbps)
Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Fat-tree Topology
k-ary fat-tree has k pods
Each pod contains two layers of k/2 switches
Each k-port switch in the lower layer is directly connected to k/2 hosts
Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches
(k/2)2 core switches
Each core switch has one port connected to each of the k pods
The ith port of any core switch is connected to pod i
A k-ary fat-tree supports k3/4 hosts
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 8 / 27
DCell
Uses a recursively defined structure to interconnect servers
Each server connects to different levels of DCells through multiple links
High-level DCells are built recursively from many low-level ones
Fault tolerant as there is no single point of failure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27
DCell
Uses a recursively defined structure to interconnect servers
Each server connects to different levels of DCells through multiple links
High-level DCells are built recursively from many low-level ones
Fault tolerant as there is no single point of failure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27
DCell
Uses a recursively defined structure to interconnect servers
Each server connects to different levels of DCells through multiple links
High-level DCells are built recursively from many low-level ones
Fault tolerant as there is no single point of failure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27
DCell
Uses a recursively defined structure to interconnect servers
Each server connects to different levels of DCells through multiple links
High-level DCells are built recursively from many low-level ones
Fault tolerant as there is no single point of failure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structure
DCell0 is the building block to construct larger DCellsI Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells
I Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells
I Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells
I Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells
I Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Structure
Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells
I Consists of n servers and a mini-switch
High-level DCells are built recursively from many low-level ones
DCell1 constructed using n+1 DCell0s
The same applies to DCellk
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 11 / 27
Outline
1 Datacenter Topologies
2 Transport Protocols
3 Network Sharing
4 Wrapping Up
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 12 / 27
TCP and UDP
TCP: Connection-oriented with reliability, ordering, and congestioncontrol
UDP: Connectionless with no ordering, reliability, or congestion control
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 13 / 27
TCP and UDP
TCP: Connection-oriented with reliability, ordering, and congestioncontrol
UDP: Connectionless with no ordering, reliability, or congestion control
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 13 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)
For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
TCP and Datacenter Networks
Communication between different nodes is thought of as just opening aTCP connection between them
Common sockets API
But TCP was designed for a wide-area network
Clearly, a datacenter is not a wide-area network
Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small
I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofseconds
In commodity switches the buffer pool is shared by all interfacesI If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
More problems for TCP
In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces
I If long flows hog the memory, queues can build up for the short flows
Many-to-one communication patterns can lead to TCP throughputcollapse or incast
I This can cause overall application throughput to decrease by up to 90%
In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs
I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27
Reaction
Some large-scale deployments have abandoned TCP altogether
For instance, Facebook now uses a custom UDP transport
It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment
Over the years, a number of alternatives have been proposed
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27
Reaction
Some large-scale deployments have abandoned TCP altogether
For instance, Facebook now uses a custom UDP transport
It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment
Over the years, a number of alternatives have been proposed
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27
Reaction
Some large-scale deployments have abandoned TCP altogether
For instance, Facebook now uses a custom UDP transport
It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment
Over the years, a number of alternatives have been proposed
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27
Reaction
Some large-scale deployments have abandoned TCP altogether
For instance, Facebook now uses a custom UDP transport
It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment
Over the years, a number of alternatives have been proposed
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27
Datacenter TCP (DCTCP)
Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control
Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold
DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets
Enables it to react quickly to queue build and avoid buffer pressure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27
Datacenter TCP (DCTCP)
Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control
Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold
DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets
Enables it to react quickly to queue build and avoid buffer pressure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27
Datacenter TCP (DCTCP)
Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control
Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold
DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets
Enables it to react quickly to queue build and avoid buffer pressure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27
Datacenter TCP (DCTCP)
Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control
Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold
DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets
Enables it to react quickly to queue build and avoid buffer pressure
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27
Multipath TCP (MPTCP)
Establishes multiple subflows over different paths between a pair ofend-hosts
These subflows operate under a single TCP connection
The fraction of the total congestion window for each flow is determinedby its speed
Moves traffic away from the most congested paths
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27
Multipath TCP (MPTCP)
Establishes multiple subflows over different paths between a pair ofend-hosts
These subflows operate under a single TCP connection
The fraction of the total congestion window for each flow is determinedby its speed
Moves traffic away from the most congested paths
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27
Multipath TCP (MPTCP)
Establishes multiple subflows over different paths between a pair ofend-hosts
These subflows operate under a single TCP connection
The fraction of the total congestion window for each flow is determinedby its speed
Moves traffic away from the most congested paths
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27
Multipath TCP (MPTCP)
Establishes multiple subflows over different paths between a pair ofend-hosts
These subflows operate under a single TCP connection
The fraction of the total congestion window for each flow is determinedby its speed
Moves traffic away from the most congested paths
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27
tcpcrypt
Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications
Uses a custom key exchange protocol that leverages the TCP optionsfield
Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones
Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27
tcpcrypt
Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications
Uses a custom key exchange protocol that leverages the TCP optionsfield
Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones
Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27
tcpcrypt
Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications
Uses a custom key exchange protocol that leverages the TCP optionsfield
Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones
Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27
tcpcrypt
Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications
Uses a custom key exchange protocol that leverages the TCP optionsfield
Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones
Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27
Deadline-Driven Delivery (D3)
Targets applications with distributed workflow and latency targets
Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met
Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27
Deadline-Driven Delivery (D3)
Targets applications with distributed workflow and latency targets
Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met
Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27
Deadline-Driven Delivery (D3)
Targets applications with distributed workflow and latency targets
Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met
Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27
Outline
1 Datacenter Topologies
2 Transport Protocols
3 Network Sharing
4 Wrapping Up
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 21 / 27
Introduction
Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviour
Network performance isolation between tenants can be an importanttool for:
I Minimizing disruption from legitimate tenants that run network-intensiveworkloads
I Protecting against malicious tenants that launch DoS attacks
The standard methodology to ensure isolation is to use VLANs
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27
Introduction
Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:
I Minimizing disruption from legitimate tenants that run network-intensiveworkloads
I Protecting against malicious tenants that launch DoS attacks
The standard methodology to ensure isolation is to use VLANs
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27
Introduction
Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:
I Minimizing disruption from legitimate tenants that run network-intensiveworkloads
I Protecting against malicious tenants that launch DoS attacks
The standard methodology to ensure isolation is to use VLANs
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27
Introduction
Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:
I Minimizing disruption from legitimate tenants that run network-intensiveworkloads
I Protecting against malicious tenants that launch DoS attacks
The standard methodology to ensure isolation is to use VLANs
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27
Virtual LAN
Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment
Nodes are grouped together by the VLAN
Broadcasts can also be sent within the same VLAN
VLAN membership information is inserted into Ethernet frames
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27
Virtual LAN
Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment
Nodes are grouped together by the VLAN
Broadcasts can also be sent within the same VLAN
VLAN membership information is inserted into Ethernet frames
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27
Virtual LAN
Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment
Nodes are grouped together by the VLAN
Broadcasts can also be sent within the same VLAN
VLAN membership information is inserted into Ethernet frames
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27
Virtual LAN
Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment
Nodes are grouped together by the VLAN
Broadcasts can also be sent within the same VLAN
VLAN membership information is inserted into Ethernet frames
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27
Rate-limiting End-hosts
In Xen the network bandwidth available to each domU can be ratelimited
Can be used to implement basic QoS
The virtual interface is simply rate-limited
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27
Rate-limiting End-hosts
In Xen the network bandwidth available to each domU can be ratelimited
Can be used to implement basic QoS
The virtual interface is simply rate-limited
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27
Rate-limiting End-hosts
In Xen the network bandwidth available to each domU can be ratelimited
Can be used to implement basic QoS
The virtual interface is simply rate-limited
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27
Outline
1 Datacenter Topologies
2 Transport Protocols
3 Network Sharing
4 Wrapping Up
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 25 / 27
The End
In reverse order:
1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds
2 Virtualization of computation, storage, and networking can allow manytenants to co-exist
3 Most data does not fit the relational model and is more suited forNoSQL stores
4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud
5 The future is Big Data and Cloud Computing!
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27
The End
In reverse order:
1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds
2 Virtualization of computation, storage, and networking can allow manytenants to co-exist
3 Most data does not fit the relational model and is more suited forNoSQL stores
4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud
5 The future is Big Data and Cloud Computing!
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27
The End
In reverse order:
1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds
2 Virtualization of computation, storage, and networking can allow manytenants to co-exist
3 Most data does not fit the relational model and is more suited forNoSQL stores
4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud
5 The future is Big Data and Cloud Computing!
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27
The End
In reverse order:
1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds
2 Virtualization of computation, storage, and networking can allow manytenants to co-exist
3 Most data does not fit the relational model and is more suited forNoSQL stores
4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud
5 The future is Big Data and Cloud Computing!
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27
The End
In reverse order:
1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds
2 Virtualization of computation, storage, and networking can allow manytenants to co-exist
3 Most data does not fit the relational model and is more suited forNoSQL stores
4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud
5 The future is Big Data and Cloud Computing!
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27
References
1 Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. Ascalable, commodity data center network architecture. In Proceedingsof the ACM SIGCOMM 2008 conference on Data communication(SIGCOMM ’08). ACM, New York, NY, USA, 63-74.
2 Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, andSongwu Lu. 2008. Dcell: a scalable and fault-tolerant networkstructure for data centers. In Proceedings of the ACM SIGCOMM 2008conference on Data communication (SIGCOMM ’08). ACM, New York,NY, USA, 75-86.
Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 27 / 27