View
345
Download
3
Category
Preview:
Citation preview
Maelstrom Ricochet Conclusion
Reliable Communication for Datacenters
Mahesh Balakrishnan
Cornell University, Ithaca, NY
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Datacenters
I Internet Services (90s) — Websites, Search, Online StoresI Since then:
# of low-end volume servers
0
5
10
15
20
25
30
2000 2001 2002 2003 2004 2005
Mill
ions
Installed Server Base 00-05:I Commodity — up by 100%I High/Mid — down by 40%
2007: ≈ 7 million new units
I Today: Datacenters are ubiquitousI How have they evolved?
Data partially sourced from IDC press releases (www.idc.com)
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Datacenters
Why?Business Continuity,Client Locality,Distributed Datasetsor Operations ...Any modernenterprise!
N
SEW
100 ms
RTT: 110 ms
210 ms
220 ms
110 ms
100 ms 200 ms
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
I Finance, Aerospace, Military, Search and Rescue...I ... documents, chat, email, games, videos, photos, blogs,
social networksI The Datacenter is the Computer!I Not hard real-time: real fast, highly responsive, time-critical
Gartner Survey:I Real-Time Infrastructure (RTI): reaction time in secs/minsI 83%: RTI is important or very importantI 84%: Have no RTI capability
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Networks of Real-Time Datacenters
I Finance, Aerospace, Military, Search and Rescue...I ... documents, chat, email, games, videos, photos, blogs,
social networksI The Datacenter is the Computer!I Not hard real-time: real fast, highly responsive, time-critical
Gartner Survey:I Real-Time Infrastructure (RTI): reaction time in secs/minsI 83%: RTI is important or very importantI 84%: Have no RTI capability
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?
Real-World
,
Real-Tim
e
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?
Real-World
,
Real-Tim
e
Disk Failure,
Disasters
Crashes Overloads
Bugs Exploits
Sof
twar
e S
tack
Compilers, Databases, Distributed Systems, Machine Learning,
Filesystems, Operating Systems, Networking ...
Packet Loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?
Real-World
,
Real-Tim
e
Disk Failure,
Disasters
Crashes Overloads
Bugs Exploits
SMFSKyotoFS
Tempest
Maelstrom
Sof
twar
e S
tack
Compilers, Databases, Distributed Systems, Machine Learning,
Filesystems, Operating Systems, Networking ...
Ricochet
Plato
Packet Loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication
Goal: Recover lost packets fast !
I Existing protocols react to loss: too much, too lateI We want proactive recovery: stable overhead, low latencies
I Maelstrom: Reliability between datacentersI Ricochet: Reliability within datacenters
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:
I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Reliable Communication between Datacenters
Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
24
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
14
24
3
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
All MeasurementsWithout Indiana U. Site
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
14
24
3
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
All MeasurementsWithout Indiana U. Site
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss
Sending End-hostsCommodity TCP
MaelstromSend-side Appliance
MaelstromReceive-side
Appliance
Receiving End-hostsCommodity TCP
Inserts FEC
FEC+Data
Recovers via FEC
FEC = Forward Error CorrectionTransparent: No modification to end-host or network
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets
Receiver can recover from any 3 lost packets
Rate1: (r , c) — c repair packets for every r data packets.
I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate
I FEC in the Network:
I Where and What: At the appliance, across multiplechannels
1Rateless codes are popular, but inapplicable to real-time streamsMahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets
Receiver can recover from any 3 lost packets
Rate1: (r , c) — c repair packets for every r data packets.
I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate
I FEC in the Network:I Where and What: At the appliance, across multiple
channels
1Rateless codes are popular, but inapplicable to real-time streamsMahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Maelstrom Network Appliance
Packet Loss
Sending End-hostsCommodity TCP
MaelstromSend-side Appliance
MaelstromReceive-side
Appliance
Receiving End-hostsCommodity TCP
Inserts FEC
FEC+Data
Recovers via FEC
FEC = Forward Error CorrectionTransparent: No modification to end-host or network
Where: at the appliance, What: aggregated data
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Maelstrom Mechanism
Send-Side Appliance:I Intercept IP packetsI Create repair packet =
XOR + ‘recipe’ of datapacket IDs
Receive-Side Appliance:I Lost packet recovered
using XOR and otherdata packets
I At receiver end-host: outof order, no loss
29 28 27 26 25
2526
2728
29
X
LOSSXOR
‘Recipe List’: 25,26,27,28,29
25 26 28 29
Lambda Jum
bo MTU
LAN M
TU
Appliance
Appliance
27
Recovered Packet
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Flow Control
I Two Flow Control Modes for TCP/IP Traffic:
A) End-to-End Flow Control
End-Host End-HostAppliance Appliance
B) Split Flow Control
End-Host End-HostAppliance Appliance
I Recall: Two problems with TCP/IP — loss and bufferingI End-to-end mode hides lossI Split mode buffers data (standard PeP)
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1
X1
1121
X2
101201
X3
Data Stream
XORs:
I XORs at different interleavesI Recovery latency degrades gracefully
with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less
Comparison of RecoveryProbability: r=7, c=2
2in2inMahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1
X1
1121
X2
101201
X3
Data Stream
XORs:
I XORs at different interleavesI Recovery latency degrades gracefully
with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Loss probabilityRe
cove
ry p
roba
bility
Reed−SolomonMaelstrom
Comparison of RecoveryProbability: r=7, c=2 2in2in
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Implementation Details
I In Kernel — Linux 2.6.20 ModuleI Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$)I Max speed: 1 Gbps, Memory Footprint: 10 MBI 50-60% CPU→ NIC is the bottleneck (for c = 3)
I How do we efficiently store/access/clean a gigabit of dataevery second?
I Scaling to Multi-Gigabit: Partition IP space across proxies
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Mahesh Balakrishnan Reliable Communication for Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Mahesh Balakrishnan Reliable Communication for Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Mahesh Balakrishnan Reliable Communication for Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Ricochet Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Mahesh Balakrishnan Reliable Communication for Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Ricochet Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Mahesh Balakrishnan Reliable Communication for Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Ricochet Conclusion
Evaluation: Delivery LatencyClaim: Maelstrom eliminates TCP/IP’s loss-related jitter
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000 6000
Del
iver
y L
aten
cy (
ms)
Packet #
TCP/IP: 0.1% Loss
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000 6000
Del
iver
y L
aten
cy (
ms)
Packet #
Maelstrom: 0.1% Loss
I Receive-side buffering due to sequencingI Send-side buffering due to congestion control
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
SMFS: The Smoke and Mirrors Filesystem
I Classic Mirroring Trade-off:I Fast — return to user after sending to mirrorI Safe — return to user after ACK from mirror
I Maelstrom: Lossy Network→ Lossless Network→ Disk!I SMFS — return to user after sending enough FECI Result: Fast, Safe Mirroring independent of link length!I General Principle: Gray-box Exposure of Protocol State
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
The Big Picture
Real-World
,
Real-Tim
e
Disk Failure,
Disasters
Crashes Overloads
Bugs Exploits
SMFSKyotoFS
Tempest
Maelstrom
Sof
twar
e S
tack
Compilers, Databases, Distributed Systems, Machine Learning,
Filesystems, Operating Systems, Networking ...
Ricochet
Plato
Packet Loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Ricochet: Multicast in the Datacenter
I Commodity Datacenters:Extreme Scale-Out
I Data Replication:I Fault ToleranceI High AvailabilityI Performance
I Updating data in multiplelocations: Multicast!
Data is Clustered / Replicated / Published / Cached...
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Ricochet: Multicast in the Datacenter
I Commodity Datacenters:Extreme Scale-Out
I Data Replication:I Fault ToleranceI High AvailabilityI Performance
I Updating data in multiplelocations: Multicast!
Upd
ates
Data is Clustered / Replicated / Published / Cached...
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
Financial Pub-Sub Example:I Each equity is mapped to
a multicast groupI Each node is interested
in a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
Financial Pub-Sub Example:I Each equity is mapped to
a multicast groupI Each node is interested
in a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
How is Multicast Used?
Financial Pub-Sub Example:I Each equity is mapped to
a multicast groupI Each node is interested
in a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Where does loss occur in a Datacenter?
Packet Loss occurs at end-hosts: independent and bursty
0
20
40
60
80
100
1801601401201008060
burs
t len
gth
(pac
kets
) receiver r1: loss bursts in A and Breceiver r1: data bursts in A and B
0
20
40
60
80
100
1801601401201008060
burs
t len
gth
(pac
kets
)
time in seconds
receiver r2: loss bursts in Areceiver r2: data bursts in A
Mahesh Balakrishnan Reliable Communication for Datacenters
LossBursts
←DataRate
←
LessLoadedNode
OverloadedNode
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
I Low-Latency ReliableMulticast
I Scalability:
I Number of ReceiversI Number of SendersI Number of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
I Low-Latency ReliableMulticast
I Scalability:I Number of Receivers
I Number of SendersI Number of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
I Low-Latency ReliableMulticast
I Scalability:I Number of ReceiversI Number of Senders
I Number of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Problem Statement
I Low-Latency ReliableMulticast
I Scalability:I Number of ReceiversI Number of SendersI Number of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable MulticastHow does latency scale?
Existing mechanisms: recovery latency ∝ 1datarate
I Example: SRM1
NAK/SequencingI data rate: at a
single sender, ina single group
What about FEC?
FEC in the Network:Where and What?
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 4 8 16 32 64 128
Mill
isec
onds
Groups per Node (Log Scale)
SRM Average Discovery Delay
1S.Floyd, V.Jacobson, C.-G.Liu, S.McCanne, and L. Zhang. A reliable multicast framework for light-weight
sessions and application level framing. IEEE/ACM ToN, 5(6):784-803, 1997
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Design Space for Reliable MulticastHow does latency scale?
Existing mechanisms: recovery latency ∝ 1datarate
I Example: SRM1
NAK/SequencingI data rate: at a
single sender, ina single group
What about FEC?FEC in the Network:Where and What? 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 4 8 16 32 64 128
Mill
isec
onds
Groups per Node (Log Scale)
SRM Average Discovery Delay
1S.Floyd, V.Jacobson, C.-G.Liu, S.McCanne, and L. Zhang. A reliable multicast framework for light-weight
sessions and application level framing. IEEE/ACM ToN, 5(6):784-803, 1997
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
I Receivers generate XORs ofincoming multicast packets andexchange with other receivers
I Each XOR sent to c otherreceivers
I latency ∝ 1Ps datarate
data rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Receiver-Based Forward Error Correction
I Receivers generate XORs ofincoming multicast packets andexchange with other receivers
I Each XOR sent to c otherreceivers
I latency ∝ 1Ps datarate
data rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Lateral Error Correction: Principle
I XORs can combine data from multiple groups
Single-Group RFEC
B2
INC
OM
ING
DA
TA P
ACK
ETS
A1
A2A3
B2
B1
A4
A5
B4
B3
A6
Repair Packet I:(A1, A2, A3, A4, A5)
Repair Packet II:(B1,B2,B3,B4,B5)
B5
Node n1
Node n2
A1
A2
A3
B1
A4
A5B4
B3
A6
B5
INC
OM
ING
DA
TA P
ACK
ETS
Recovery
Loss
Lateral Error Correction
Repair Packet I:(A1, A2, A3, B1, B2)
B2
INC
OM
ING
DA
TA P
ACK
ETS
A1
A2A3
B2
B1
A4
A5
B4
B3
A6
Repair Packet II:(A4, A5, B3, B4, A6)
B5
Node n1
Node n2
A1
A2
A3
B1
A4
A5B4
B3
A6
B5
INC
OM
ING
DA
TA P
ACK
ETS
Recovery
Loss
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
2
1
4
3
I Receiver n1 belongs togroups A, B, and C
I Divides groups intodisjoint regions
I Is unaware of groups itdoes not belong to (D)
I Works with any conventional Group Membership ServiceI Can we build a scalable multi-group GMS?
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Nodes and Disjoint Regions
2
1
4
3
I Receiver n1 belongs togroups A, B, and C
I Divides groups intodisjoint regions
I Is unaware of groups itdoes not belong to (D)
I Works with any conventional Group Membership ServiceI Can we build a scalable multi-group GMS?
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1
Aac
A a
A abc
1.0cAab
I Select targets for XORsfrom regions, not groups
I From each region, selectproportional fraction ofcA: cx
A = |x ||A| · cA
latency ∝ 1Ps
Pg datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Regional Selection
1
Aac
A a
A abc
1.0cAab
I Select targets for XORsfrom regions, not groups
I From each region, selectproportional fraction ofcA: cx
A = |x ||A| · cA
latency ∝ 1Ps
Pg datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Scalability in GroupsClaim: Ricochet scales to hundreds of groups.
90
92
94
96
98
100
2 4 8 16 32 64 128 256 512 1024
LE
C R
ecov
ery
Perc
enta
ge
Groups per Node (Log Scale)
Recovery %
0
10000
20000
30000
40000
50000
2 4 8 16 32 64 128 256 512 1024
Mic
rose
cond
s
Groups per Node (Log Scale)
Average Recovery Latency
At 128 groups, SRM latency was 8 seconds. 400 times slower!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation
Distribution of Recovery LatencyClaim: Ricochet is reliable and time-critical
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
96.8% LEC + 3.2% NAK
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
92% LEC + 8% NAK
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
84% LEC + 16% NAK
(a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate
Most lost packets recovered < 50ms by LEC.Remainder via reactive NAKs.
Bursty Loss: 100 packet burst→ 90% recovered at 50 ms avg
Mahesh Balakrishnan Reliable Communication for Datacenters
LEC
↓ NAK
↓
Maelstrom Ricochet Conclusion
The Big Picture: The Datacenter is the Computer
Real-World
,
Real-Tim
e
Disk Failure,
Disasters
Crashes Overloads
Bugs Exploits
SMFSKyotoFS
Tempest
Maelstrom
Sof
twar
e S
tack
Compilers, Databases, Distributed Systems, Machine Learning,
Filesystems, Operating Systems, Networking ...
Ricochet
Plato
Packet Loss
NSDI 2008NSDI 2007
DSN 2008
SRDS 2006
HotOS XISubmitted
MistralMobiHoc 2006
SequoiaPODC 2007
Mahesh Balakrishnan Reliable Communication for Datacenters
Bottomline: Performance versus Price
We need new abstractions!Power, Parallelism, Privacy...
Maelstrom Ricochet Conclusion
The Big Picture: The Datacenter is the Computer
Real-World
,
Real-Tim
e
Disk Failure,
Disasters
Crashes Overloads
Bugs Exploits
SMFSKyotoFS
Tempest
Maelstrom
Sof
twar
e S
tack
Compilers, Databases, Distributed Systems, Machine Learning,
Filesystems, Operating Systems, Networking ...
Ricochet
Plato
Packet Loss
NSDI 2008NSDI 2007
DSN 2008
SRDS 2006
HotOS XISubmitted
MistralMobiHoc 2006
SequoiaPODC 2007
Mahesh Balakrishnan Reliable Communication for Datacenters
Bottomline: Performance versus Price
We need new abstractions!Power, Parallelism, Privacy...
Maelstrom Ricochet Conclusion
Conclusion
I The Real-Time DatacenterI Recover from failures within seconds
I Reliable Communication: FEC in the NetworkI Recover lost packets in millisecondsI Maelstrom: between DatacentersI Ricochet: within Datacenters
Thank You!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Conclusion
I The Real-Time DatacenterI Recover from failures within seconds
I Reliable Communication: FEC in the NetworkI Recover lost packets in millisecondsI Maelstrom: between DatacentersI Ricochet: within Datacenters
Thank You!
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
I Existing solution:interleaving
I Interleave i and rate(r , c) tolerates (c ∗ i)burst...
I ...with i times the latency
A B C D E
X X X
F G H I J
A C E G I
X X XB D F H J
Figure: Interleave of 2 — Evenand Odd packets encodedseparately
Wanted: Graceful degradation of recovery latency with actualburst size for constant overhead
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: FEC and Bursty Loss
I Existing solution:interleaving
I Interleave i and rate(r , c) tolerates (c ∗ i)burst...
I ...with i times the latency
A B C D E
X X X
F G H I J
A C E G I
X X XB D F H J
Figure: Interleave of 2 — Evenand Odd packets encodedseparately
Wanted: Graceful degradation of recovery latency with actualburst size for constant overhead
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: Maelstrom EvaluationMaelstrom goodput is near theoretical maximum
0
100
200
300
400
500
600
700
800
900
1000
0 1 2 3 4 5 6
Thr
ough
put (
Mbi
ts/s
ec)
FEC Layers (c)
theoretical goodputM-FEC goodput
Mahesh Balakrishnan Reliable Communication for Datacenters
Maelstrom Ricochet Conclusion
Extra Slide: Layered Interleaving
0
20
40
60
80
100
0 200 400 600 800 1000
Reco
very
Lat
ency
(Mill
iseco
nds)
Recovered Packet #
Reed SolomonLayered Interleaving
Mahesh Balakrishnan Reliable Communication for Datacenters
Recommended