Upload
lyngoc
View
223
Download
1
Embed Size (px)
Citation preview
Maelstrom Conclusion
Maelstrom: An Enterprise Continuity Protocolfor Financial Datacenters
Mahesh Balakrishnan, Ken BirmanTudor Marian, Hakim Weatherspoon
Cornell University, Ithaca, NY
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Datacenters
I Internet Services (90s) — Websites, Search, Online StoresI Since then:
# of low-end volume servers
0
5
10
15
20
25
30
2000 2001 2002 2003 2004 2005
Mill
ions
Installed Server Base 00-05:I Commodity — up by 100%I High/Mid — down by 40%
2007: ≈ 7 million new units
I Today: Datacenters are ubiquitousI How have they evolved?
Data partially sourced from IDC press releases (www.idc.com)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Networks of Datacenters
Why?Client Locality,Distributed Datasetsor Operations,Enterprise Continuity
N
SEW
100 ms
RTT: 110 ms
210 ms
220 ms
110 ms
100 ms 200 ms
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Real-Time Enterprise Continuity
I Financial Datacenters: Real-Time, Mission-CriticalI Current State-of-the-art: Mirror from NYC to NJI Wanted: Arbitrarily far mirrors!
I Step Zero: How do we send data reliably acrosshigh-speed long-distance pipes?
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Reliable Communication between Datacenters
Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:
I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Reliable Communication between Datacenters
Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Reliable Communication between Datacenters
Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize
RTTI BufSize = Receiver Buffer Size
I Current Solution: Manually reset buffer sizes
I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery
Current State-of-the-Art Solution: $$$!
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
24
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
14
24
3
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
All MeasurementsWithout Indiana U. Site
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...
I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!
I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...
14
24
3
14
0
5
10
15
20
25
30
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1
% of Lost Packets
% o
f Mea
sure
men
ts
All MeasurementsWithout Indiana U. Site
Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
The Maelstrom Network Appliance
Packet Loss
Sending End-hostsCommodity TCP
MaelstromSend-side Appliance
MaelstromReceive-side
Appliance
Receiving End-hostsCommodity TCP
Inserts FEC
FEC+Data
Recovers via FEC
FEC = Forward Error CorrectionTransparent: No modification to end-host or network
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets
Receiver can recover from any 3 lost packets
Rate1: (r , c) — c repair packets for every r data packets.
I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate
I FEC in the Network:
I Where and What: At the appliance, across multiplechannels
1Rateless codes are popular, but inapplicable to real-time streamsKen Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
What is FEC?
A B C D E X X X C D E A B
3 repair packets from every 5 data packets
Receiver can recover from any 3 lost packets
Rate1: (r , c) — c repair packets for every r data packets.
I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate
I FEC in the Network:I Where and What: At the appliance, across multiple
channels
1Rateless codes are popular, but inapplicable to real-time streamsKen Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
The Maelstrom Network Appliance
Packet Loss
Sending End-hostsCommodity TCP
MaelstromSend-side Appliance
MaelstromReceive-side
Appliance
Receiving End-hostsCommodity TCP
Inserts FEC
FEC+Data
Recovers via FEC
FEC = Forward Error CorrectionTransparent: No modification to end-host or network
Where: at the appliance, What: aggregated data
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Maelstrom Mechanism
Send-Side Appliance:I Intercept IP packetsI Create repair packet =
XOR + ‘recipe’ of datapacket IDs
Receive-Side Appliance:I Lost packet recovered
using XOR and otherdata packets
I At receiver end-host: outof order, no loss
29 28 27 26 25
2526
2728
29
X
LOSSXOR
‘Recipe List’: 25,26,27,28,29
25 26 28 29
Lambda Jum
bo MTU
LAN M
TU
Appliance
Appliance
27
Recovered Packet
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Flow Control
I Two Flow Control Modes for TCP/IP Traffic:
A) End-to-End Flow Control
End-Host End-HostAppliance Appliance
B) Split Flow Control
End-Host End-HostAppliance Appliance
I Recall: Two problems with TCP/IP — loss and bufferingI End-to-end mode hides lossI Split mode buffers data (standard PeP)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1
X1
1121
X2
101201
X3
Data Stream
XORs:
I XORs at different interleavesI Recovery latency degrades gracefully
with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less
Comparison of RecoveryProbability: r=7, c=2
2in2inKen Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size
3 2 1
X1
1121
X2
101201
X3
Data Stream
XORs:
I XORs at different interleavesI Recovery latency degrades gracefully
with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Loss probabilityRe
cove
ry p
roba
bility
Reed−SolomonMaelstrom
Comparison of RecoveryProbability: r=7, c=2 2in2in
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Implementation Details
I In Kernel — Linux 2.6.20 ModuleI Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$)I Max speed: 1 Gbps, Memory Footprint: 10 MBI 50-60% CPU→ NIC is the bottleneck (for c = 3)
I How do we efficiently store/access/clean a gigabit of dataevery second?
I Scaling to Multi-Gigabit: Partition IP space across proxies
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Conclusion
Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP
0 100 200 300 400 500 600 700 800 900
1000
0 50 100 150 200 250
Thro
ughp
ut (M
bps)
One Way Link Latency (ms)
TCP No LossMaelstrom No Loss
Maelstrom (0.1%)Maelstrom (1.0%)
TCP/IP (0.1%)TCP (1% loss)
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Tput ∝ 1RTT
{→Data+ FEC=̃ 1 Gbps
↑TCP/IP with loss
TCP/IP without loss
↓
Maelstrom Conclusion
Evaluation: Split Mode and bufferingClaim: Maelstrom performance is independent of link length
0
100
200
300
400
500
600
20 30 40 50 60 70 80 90 100
Tpu
t (M
bps)
One-way link latency (ms)
Tput as a function of latency
loss-rate=0.000loss-rate=0.001loss-rate=0.005loss-rate=0.010
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom split mode is as good as hand tuning of buffer sizes
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom split mode is as good as hand tuning of buffer sizes
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom split mode is as good as hand tuning of buffer sizes
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Conclusion
Evaluation: Split mode and bufferingClaim: Maelstrom split mode is as good as hand tuning of buffer sizes
0
100
200
300
400
500
600
client-buf M-ALL M-FEC M-BUF tcp
Thr
ough
put (
Mbi
ts/s
ec)
RTT 100ms
throughput 0.1% loss
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
← Maelstrom FEC
Maelstrom FEC + Split Mode
↓
↑Maelstrom FEC + Hand-Tuned Buffers
Maelstrom Conclusion
Evaluation: Delivery LatencyClaim: Maelstrom eliminates TCP/IP’s loss-related jitter
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000 6000
Del
iver
y L
aten
cy (
ms)
Packet #
TCP/IP: 0.1% Loss
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000 6000
Del
iver
y L
aten
cy (
ms)
Packet #
Maelstrom: 0.1% Loss
I Receive-side buffering due to sequencingI Send-side buffering due to congestion control
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
SMFS: The Smoke and Mirrors Filesystem
I Classic Mirroring Trade-off:I Fast — return to user after sending to mirrorI Safe — return to user after ACK from mirror
I Maelstrom: Lossy Network→ Lossless Network→ Disk!I SMFS — return to user after sending enough FECI Result: Fast, Safe Mirroring independent of link length!I General Principle: Gray-box Exposure of Protocol State
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Future Work: User-Mode Packet Processors
I Common need: packet processing at gigabit speeds oncommodity HW/OS
I Examples: overlay routing, deep packet inspection,application/protocol acceleration (Maelstrom)...
I Current option: write in-kernel codeI Proposed solution: Lightweight User-Mode Processes
I Narrow packet-specific interface to kernelI Developer writes C code against custom libraryI Pinned memory, circular buffers
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters
Maelstrom Conclusion
Conclusion
I How do enterprises recover from failures in real-time?I Enabling networks of remote datacentersI Step 0: Reliable Communication — MaelstromI Step 1: Data Mirroring — SMFSI What’s next?
Ken Birman Maelstrom: Enterprise Continuity for Financial Datacenters