33
Tail Latency: Networking

Tail Latency: Networking

  • Upload
    corine

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Tail Latency: Networking. The story thus far. Tail latency is bad Causes: Resource contention with background jobs Device failure Uneven-split of data between tasks Network congestion for reducers. Ways to address tail latency. Clone all tasks Clone slow tasks Copy intermediate data - PowerPoint PPT Presentation

Citation preview

Page 1: Tail Latency: Networking

Tail Latency: Networking

Page 2: Tail Latency: Networking

The story thus far

• Tail latency is bad

• Causes:– Resource contention with background jobs– Device failure– Uneven-split of data between tasks– Network congestion for reducers

Page 3: Tail Latency: Networking

Ways to address tail latency

• Clone all tasks• Clone slow tasks• Copy intermediate data • Remove/replace frequently failing machines• Spread out reducers

Page 4: Tail Latency: Networking

What is missing from this picture?

• Networking:– Spreading out reducers is not sufficient.

• The network is extremely crucial– Studies on Facebook traces show that [Orchestra]• in 26% of jobs, shuffle is 50% of runtime.• in 16% of jobs, shuffle is more than 70% of runtime• 42% of tasks spend over 50% of their time writing to

HDFS

Page 5: Tail Latency: Networking

Other implication of Network Limits Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

5

10 30 60 900

50

100

150

200

250CommunicationComputation

Number of machines

Iter

atio

n ti

me

(s)

Did not scale beyond 60 nodes» Comm. time increased

faster than comp. time decreased

Page 6: Tail Latency: Networking

What is the Impact of the Network

• Assume 10ms deadline for tasks [DCTCP]• Simulate job completion times based on distributions

of tasks completion times

• For 40 about 4 tasks (14%)for 400 14 tasks [3%] fail respectively

Page 7: Tail Latency: Networking

What is the Impact of the Network

• Assume 10ms deadline for tasks [DCTCP]• Simulate job completion times based on distributions

of tasks completion times (focus on 99.9%)

• For 40 about 4 tasks (14%)for 400 14 tasks [3%] fail respectively

Page 8: Tail Latency: Networking

What is the Impact of the Network

• Assume 10ms deadline for tasks [DCTCP]• Simulate job completion times based on distributions

of tasks completion times

• For 40 about 4 tasks (14%)for 400 14 tasks [3%] fail respectively

Page 9: Tail Latency: Networking

Other implication of Network Limits Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

9

10 30 60 900

50

100

150

200

250CommunicationComputation

Number of machines

Iter

atio

n ti

me

(s)

Did not scale beyond 60 nodes» Comm. time increased

faster than comp. time decreased

Page 10: Tail Latency: Networking

What Causes this Variation in Network Transfer Times?

• First let’s look at type of traffic in network

• Background Traffic– Latency sensitive short control messages; e.g. heart

beats, job status – Large files: e.g. HDFS replication, loading of new data

• Map-reduce jobs– Small RPC-request/response with tight deadlines– HDFS reads or writes with tight deadlines

Page 11: Tail Latency: Networking

What Causes this Variation in Network Transfer Times?

• No notion of priority– Latency sensitive and non-latency sensitive share the

network equally.• Uneven load-balancing – ECMP doesn’t schedule flows evenly across all paths– Assume long and short are the same

• Bursts of traffic– Networks have buffers which reduce loss but introduce

latency (time waiting in buffer is variable)– Kernel optimization introduce burstiness

Page 12: Tail Latency: Networking

Ways to Eliminate Variation and Improve tail latency

• Make the network faster– HULL, DeTail, DCTCP– Faster networks == smaller tail

• Optimize how application use the network– Orchestra, CoFlows– Specific big-data transfer patterns, optimize the patterns to reduce transfer

time

• Make the network aware of deadlines– D3, PDQ– Tasks have deadlines. No point doing any work if deadline wouldn’t be met– Try and prioritize flows and schedule them based on deadline.

Page 13: Tail Latency: Networking

Fair-Sharing or Deadline-based sharing

• Fair-share (Status-Quo)– Every one plays nice but some deadlines lines can be missed

• Deadline-based– Deadlines met but may require non-trial implemantionat

• Two ways to do deadline-based sharing– Earliest deadline first (PDQ)– Make BW reservations for each flow

• Flow rate = flow size/flow deadline• Flow size & deadline are known apriori

Page 14: Tail Latency: Networking

Fair-Sharing or Deadline-based sharing

• Two versions of deadline-based sharing– Earliest deadline first (PDQ)– Make BW reservations for each flow• Flow rate = flow size/flow deadline• Flow size & deadline are known apriori

Page 15: Tail Latency: Networking

Issues with Deadline Based Scheduling

• Implications for non-deadline based jobs– Starvation? Poor completion times?

• Implementation Issues– Assign deadlines to flows not packets– Reservation approach

• Requires reservation for each flow• Big data flows: can be small & have small RTT

– Control loop must be extremelly fast

– Earliest deadline first• Requires coordination between switches & servers• Servers: specify flow deadline• Switches: priority flows and determine rate

– May require complex switch mechanisms

Page 16: Tail Latency: Networking

How do you make the Network Faster

• Throw more hardware at the problem– Fat-Tree, VL2, B-Cube, Dragonfly– Increases bandwidth (throughput) but not

necessarily latency

Page 17: Tail Latency: Networking

So, how do you reduce latency

• Trade bandwidth for latency– Buffering adds variation (unpredictability)– Eliminate network buffering & bursts

• Optimize the network stack– Use link level information to detect congestion– Inform application to adapt by using a different

path

Page 18: Tail Latency: Networking

HULL: Trading BW for Latency

• Buffering introduces latency– Buffer is used to accommodate bursts– To allow congestion control to get good

throughput• Removing buffers means– Lower throughput for large flows– Network can’t handle bursts– Predictable low latency

Page 19: Tail Latency: Networking

Why do Bursts Exists?

• Systems review:– NIC (network Card) informs OS of packets via

interrupt• Interrupt consume CPU• If one interrupt for each packet the CPU will be

overwhelmed– Optimization: batch packets up before calling

interrupt• Size of the batch is the size of the burst

Page 20: Tail Latency: Networking

Why do Bursts Exists?

• Systems review:– NIC (network Card) informs OS of packets via

interrupt• Interrupt consume CPU• If one interrupt for each packet the CPU will be

overwhelmed– Optimization: batch packets up before calling

interrupt• Size of the batch is the size of the burst

Page 21: Tail Latency: Networking

Why Does Congestion Need buffers?

• Congestion Control AKA TCP– Detects bottleneck link capacity through packet loss– When loss it halves its sending rate.

• Buffers help the keep the network busy– Important for when TCP reduce sending rate by half

• Essentially the network must double capacity for TCP to work well.– Buffer allow for this doubling

Page 22: Tail Latency: Networking

22

TCP Review

• Bandwidth-delay product rule of thumb:– A single flow needs C×RTT buffers for 100% Throughput.

Thro

ughp

utBu

ffer S

ize

100%

B

B ≥ C×RTT

B

100%

B < C×RTT

Page 23: Tail Latency: Networking

Key Idea Behind Hull

• Eliminate Bursts– Add a token bucket (Pacer) into the network– Pacer must be in the network so it happens after the system

optimizations that cause bursts.

• Eliminate Buffering– Send congestion notification messages before link it fully

utilized• Make applications believe the link is full when there’s still capacity

– TCP has poor congestion control algorithm• Replace with DCTCP

Page 24: Tail Latency: Networking

Key Idea Behind Hull

• Eliminate Bursts– Add a token bucket (Pacer) into the network– Pacer must be in the network so it happens after the

system optimizations that cause bursts.

• Eliminate Buffering– Send congestion notification messages before link it

fully utilized• Make applications believe the link is full when there’s still

capacity

Page 25: Tail Latency: Networking

Orchestra: Managing Data Transfers in Computer Clusters

• Group all flows belonging to a stage into a transfer

• Perform inter-transfer coordination

• Optimize at the level of transfer rather than individual flows

Page 26: Tail Latency: Networking

Transfer Patterns

Transfer: set of all flows transporting data between two stages of a job– Acts as a barrier

Completion time: Time for the last receiver to finish

Shuffle

Broadcast

Incast*

26

Map Map Map

HDFS

Reduce Reduce

HDFS

Page 27: Tail Latency: Networking

TC (broadcast) TC (broadcast)HDFSTree

Cornet

HDFSTree

Cornet

TC (shuffle)

Hadoop shuffleWSS

shuffle broadcast 1 broadcast 2

ITCFair sharing

FIFOPriority

27

ShuffleTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

BroadcastTransfer

Controller (TC)

Inter-TransferController (ITC)

OrchestraCooperative broadcast (Cornet)

– Infer and utilize topology information

Weighted Shuffle Scheduling (WSS)

– Assign flow rates to optimize shuffle completion time

Inter-Transfer Controller– Implement weighted fair

sharing between transfers

End-to-end performance

Page 28: Tail Latency: Networking

Cornet: Cooperative broadcast

Observations Cornet Design Decisions

1. High-bandwidth, low-latency network Large block size (4-16MB)

2. No selfish or malicious peers No need for incentives (e.g., TFT) No (un)choking Everyone stays till the end

3. Topology matters Topology-aware broadcast 28

Broadcast same data to every receiver»Fast, scalable, adaptive to bandwidth, and

resilient

Peer-to-peer mechanism optimized for cooperative environmentsUse bit-torrent to distribute data

Page 29: Tail Latency: Networking

Topology-aware Cornet

Many data center networks employ tree topologiesEach rack should receive exactly one copy of broadcast– Minimize cross-rack communication

Topology information reduces cross-rack data transfer– Mixture of spherical Gaussians to infer network

topology

29

Page 30: Tail Latency: Networking

Status quo in Shuffle

31

r1 r2

s2 s3 s4s1 s5

Links to r1 and r2 are full:

Link from s3 is full:

Completion time:

3 time units

2 time units

5 time units

Page 31: Tail Latency: Networking

Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent

32

Up to 1.5X improvement

Completion time: 4 time units

Weighted Shuffle Scheduling

r1 r2

s2 s3 s4s1 s5

1 1 2 2 1 1

Page 32: Tail Latency: Networking

Faster spam classification

33

Communication reduced from 42% to 28% of the iteration time

Overall 22% reduction in iteration time

Page 33: Tail Latency: Networking

Summary• Discuss tail latency in network

– Types of traffic in network– Implications on jobs– Cause of tail latency

• Discuss Hull:– Trade Bandwidth for latency– Penalize huge flows– Eliminate bursts and buffering

• Discuss Orchestra: – Optimize transfers instead of individual flows

• Utilize knowledge about application semantics

34http://www.mosharaf.com/