‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University

‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery

Florin Dinu T. S. Eugene NgRice University

2

is Widely Used

Image Processing

Protein Sequencing

Web Indexing

Machine Learning

Advertising Analytics

Log Storage and Analysis

*

* Source: http://wiki.apache.org/hadoop/PoweredBy

2010

Recent research

work

3

Compute-Node Failures Are Common

“ ... typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours”

Jeff Dean – Google I/O 2008

“ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006

Revenue Reputation User experience

4

Compute-node failures are common and damaging

is widely used

How does behave under compute-node failures?

vs

Inflated, variable and unpredictable job running times. Sluggish failure detection.

What are the design decisions responsible? Answer in this work.

5

Focus of This WorkTask Tracker failures

• Loss of intermediate data• Loss of running tasks• Data Nodes not failed

Types of failures• Task Tracker process fail-stop failures• Task Tracker node fail-stop failures

Single failures• Expose mechanisms and their

interactions• Findings also apply to multiple

failures

NameNode

JobTracker

Task Tracker

Mapper

Reducer

Data Node

6

Declaring a Task Tracker Dead

Time

Heartbeats from Task Tracker to Job Tracker

Usually every 3s

Job Tracker checks if heartbeats not sent

for at least 600s

200s Restart running tasksRestart completed maps

Conservative design

<200 <400 <600 >600

7

Declaring a Task Tracker Dead

Time

Variable failure detection time

<200 <400 <600 >600

<200 <400 <600 >600

Detection time ~ 800s

Detection time ~ 600s

Time

8

• Uses notifications from running reducers to Job Tracker• A message that a specific map output is unavailable

• Restart map M to re-compute its lost output• #notif(M) > (0.5* #running reducers) and #notif(M) > 3

Declaring Map Output Lost

JobTracker

Time

X

Conservative design Static parameters <200 <400 <600 >600

9

Reducer NotificationsSignals a specific map output is unavailable

On connection error (R1)• re-attempt connection• send notification when

nr of attempts % 10 = 0• exponential wait between attempts

wait = 10*(1.3)^(nr_failed_attempts)

• usually 416s needed for 10 attempts

On read error (R2)• send notification immediately

M5

R1

R2

XX

JobTracker

Conservative design Static parameters

10

Declaring a Reducer Faulty• Reducer faulty if (simplified version):

#shuffles failed > 0.5* #shuffles attempted and

#shuffles succeeded < 0.5* #shuffles necessary or

reducer stalled for too long

Ignores cause of failed shuffles. Static parameters

X

11

Experiment: Methodology• 15-node, 4-rack testbed in the OpenCirrus* cluster

• 14 compute nodes, 1 reserved for Job Tracker and Name Node

• Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment

• Job takes 220s in the absence of failures

• Inject single Task Tracker process failure randomly between 0 and 220s

* https://opencirrus.org/ the HP/Intel/Yahoo! Open Cloud Computing Research Testbed

https://opencirrus.org/

12Large variability in job running times

Experiment: Results

13Large variability in job running times

Experiment: Results

Group G2

Group G6 Group G7

Group G3

Group G5

Group G1

Group G4

14

Group G1 – few reducers impacted

Slow recovery when few reducers impacted

M1

R1

M1 copied by all reducers before failure.

R1_1X

JobTracker

After failure R1_1 cannot access M1.R1_1 needs to send 3 notifications ~ 1250sTask Tracker declared dead after 600-800s

M2M3 Notif

(M1)

R2

R3

15

Group G2 – timing of failure

Timing of failure relative to Job Tracker checks impacts job running time

TimeG1

G2

170s

170sTime

Job end

600s

600s

200s

200s difference between G1 and G2.

16

Group G3 – early notifications

Early notifications increase job running time variability

• G1 notifications sent after 416s

• G3 early notifications => map outputs declared lostCauses:

• Code-level race conditions• Timing of a reducer’s shuffle attempts

0 1 2 3 4 5 6

0 1 2 3 4 5 6

Regular notification (416s)

Early notification (<416s)

17

Group G4 & G5 – many reducers impacted

Job running time under failure varies with nr of reducers impacted

R1_1

X

JobTracker

G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared deadG5 - Same as G4 but early notifications are sent

Notif(M1,M2,M3,M4,M5)

M1

R1M2M3

R2

R3

18

Induced Reducer Death Reducer faulty if (simplified version):

#shuffles failed ------------------------------ > 0.5 #shuffles attempted

and

#shuffles succeeded ------------------------------ < 0.5 or stalled for too long #shuffles necessary

• If failed Task Tracker is contacted among first Task Trackers => the reducer dies

• If failed Task Tracker is attempted too many times => the reducer dies

A failure can induce other failures in healthy reducers.CPU time and network bandwidth are unnecessarily wasted.

X

19

56 vs 14 Reducers

Job running times are spread out even moreIncreased chance for induced reducer death or early notifications

CDF

20

Simulating Node Failure

Without RST packets all affected tasks wait for Task Tracker to be declared dead.

CDF

21

Lack of Adaptivity

Recall: • Notification sent after 10 attempts

Inefficiency:• A static, one size fits all solution cannot handle all situations

Efficiency varies with number of reducers

A way forward:• Use more detailed information about current job state

22

Conservative Design

Recall:• Declare a Task Tracker dead after at least 600s• Send a notification after 10 attempts and 416 seconds

Inefficiency:• Assumes most problems are transient• Sluggish response to permanent compute-node failure

A way forward:• Additional information should be leveraged

• Network state information• Historical information of compute-node behavior [OSDI ‘10]

23

Simplistic Failure Semantics

• Lack of TCP connectivity = problem with tasks

Inefficiency:• Cannot distinguish between multiple causes for lack of connectivity

• Transient congestion• Compute-node failure

A way forward:• Decouple failure recovery from overload recovery

• Use AQM/ECN to provide extra congestion information• Allow direct communication between application and

infrastructure

24

Thank you

Company and product logos from company’s website.Conference logos from the conference websites.

Links to images:http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2http://www.scanex.ru/imgs/data-processing-sample1.jpghttp://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxchttp://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpghttp://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpghttp://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg

http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2

http://www.scanex.ru/imgs/data-processing-sample1.jpg

http://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxc

http://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpg

http://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpg

http://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg

25

Group G3 – early notifications

Early notifications increase job running time variability

• G1 notifications sent after 416s

• G3 early notifications => map outputs declared lostCauses:

• Code-level race conditions• Timing of a reducer’s shuffle attempts

R2

X

M5R2

X

0 1 2 3 4 5 6

M5-1M6-1

M5-2M6-2

M5-3M6-3

M5-4M6-4

M6-1 M5-1M6-2

M5-2M6-3

M5-3M6-4

0 1 2 3 4 5 6

M5

M6

M6

M5-4M6-5

26

Task Tracker Failure-Related Mechanisms

Declaring a Task Tracker dead

Declaring a mapoutput lost

Declaring a reducer faulty

Documents

‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University