51
Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng

Understanding the Effects and Implications of Compute Node Failures in

  • Upload
    fayola

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Understanding the Effects and Implications of Compute Node Failures in . Florin Dinu T. S. Eugene Ng. Computing in the Big Data Era. Big Data – Challenging for previous systems Big Data Frameworks MapReduce @ Google Dryad @Microsoft Hadoop @ Yahoo & Facebook. 100PB. 20PB. 15PB. - PowerPoint PPT Presentation

Citation preview

Page 1: Understanding the Effects and Implications of Compute Node Failures in

Understanding the Effects and Implications of Compute Node Failures in

Florin Dinu T. S. Eugene Ng

Page 2: Understanding the Effects and Implications of Compute Node Failures in

2

Computing in the Big Data Era

15PB20PB100PB120PB

• Big Data – Challenging for previous systems

• Big Data Frameworks – MapReduce @ Google– Dryad @Microsoft– Hadoop @ Yahoo & Facebook

Page 3: Understanding the Effects and Implications of Compute Node Failures in

3

Image Processing

Protein Sequencing

Web Indexing

Machine Learning

Advertising Analytics

Log Storage and Analysis

Is Widely Used

and many more …..

Page 4: Understanding the Effects and Implications of Compute Node Failures in

4

SIGMOD 2010

Building Around

Page 5: Understanding the Effects and Implications of Compute Node Failures in

5

Building On Top Of

Building on core Hadoop functionality

Page 6: Understanding the Effects and Implications of Compute Node Failures in

6

The Danger of Compute-Node Failures

“ In each cluster’s first year,it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur”

Jeff Dean – Google I/O 2008

Causes: • large scale• use of commodity components

“ Average worker deaths per job: 5.0 ” Jeff Dean – Keynote I – PACT 2006

Page 7: Understanding the Effects and Implications of Compute Node Failures in

7

The Danger of Compute-Node Failures

In the cloud compute node failuresare the norm NOT the exception

Amazon, SOSP 2009

Page 8: Understanding the Effects and Implications of Compute Node Failures in

8

Failures From Hadoop’s Point of View

Important to understand effect of compute-node failures on Hadoop

Situations indistinguishable from compute node failures:

• Switch failures

• Longer-term dis-connectivity• Unplanned reboots• Maintenance work (upgrades)• Quota limits

• Challenging environments• Spot markets (price driven availability)• Volunteering systems• Virtualized environments

Page 9: Understanding the Effects and Implications of Compute Node Failures in

9

• Hadoop is widely used• Compute node failures are common

Hadoop needs to be failure resilient

The Problem

Hadoop needs to be failure resilient in an efficient way

• Minimize impact on job running times• Minimize resources needed

Page 10: Understanding the Effects and Implications of Compute Node Failures in

10

Contribution

• First in-depth analysis of the impact of failures on Hadoop

– Uncover several inefficiencies• Potential for future work

– Immediate practical relevance

– Basis for realistic modeling of Hadoop

Page 11: Understanding the Effects and Implications of Compute Node Failures in

11

Quick Hadoop Background

Page 12: Understanding the Effects and Implications of Compute Node Failures in

12

Background – the Tasks

RM

MGR

Master

DataNode

TaskTracker

Reducer taskMap task

Give me work !

RM

M RMore work ?

JobTrackerNameNode

RM

2 waves of R 2 waves of M

Page 13: Understanding the Effects and Implications of Compute Node Failures in

13

Background – Data Flow

M M M

R R R

HDFS

Map Tasks

Shuffle

Reducer Tasks

HDFS

Page 14: Understanding the Effects and Implications of Compute Node Failures in

14

Background – Speculative Execution

M M M

0 <= Progress Score <= 1 Progress Rate = (Progress Score/time) Ex: 0.05/sec

Ideal case:Similar progress rates

Page 15: Understanding the Effects and Implications of Compute Node Failures in

15

Background – Speculative Execution (SE)

M M M

Reality:Varying progress rates

!Goal of SE:

Detect underperforming nodesDuplicate the computation

Reasons for underperforming tasksNode overload, network congestion, etc.

Underperforming tasks (outliers) in Hadoop:> 1 STD slower than mean progress rate

M

Page 16: Understanding the Effects and Implications of Compute Node Failures in

16

How does Hadoop detect failures?

Page 17: Understanding the Effects and Implications of Compute Node Failures in

17

MGR

Master

DataNode

TaskTracker

M R

Failures of the Distributed Processes

Timeouts, Heartbeats & Periodic Checks

Heartbeats

Page 18: Understanding the Effects and Implications of Compute Node Failures in

18

Timeouts, Heartbeats & Periodic Checks

Conservative approach – last line of defense

Time

Failure interrupts heartbeat stream

Periodically check for changes

Declare failure after a number of checks

AHA !It failed

Page 19: Understanding the Effects and Implications of Compute Node Failures in

19

Failures of the Individual Tasks (Maps)M M

R R

Infer map failures from notificationsConservative – not react to temporary failures

MGR

R

Give me data!

M does not answer !!

1 2 3

RR M does not answer !!

M

Δt Δt

Page 20: Understanding the Effects and Implications of Compute Node Failures in

20

• R complains too much? (failed/ succ. attempts)

• R stalled for too long? (no new succ. attempts)

M M

R R

Notifications also help infer reducer failures

Give me data!

M M

Give me data!

X

MGR M does not answer !! R

Failures of the Individual Tasks (Reducers)

Page 21: Understanding the Effects and Implications of Compute Node Failures in

21

Do these mechanisms work well?

Page 22: Understanding the Effects and Implications of Compute Node Failures in

22

Methodology• Focus on failures of distributed components (TaskTracker and DataNode)

• Inject these failures separately

• Single failures– Enough to catch many shortcomings– Identified mechanisms responsible– Relevant to multiple failures too

DataNode

TaskTracker

M R

Page 23: Understanding the Effects and Implications of Compute Node Failures in

23

Mechanisms Under Task Tracker Failure?

LARGE, VARIABLE, UNPREDICTABLE job running timesPoor performance under failure

• OpenCirrus

• Sort 10GB

• 15 nodes

• 14 reducers

• Inject fail at random time

220s running time without failuresFindings also relevant to larger jobs

Page 24: Understanding the Effects and Implications of Compute Node Failures in

24

Few reducers impacted. Notification mechanismineffective

Timeouts fire.

70% cases – notification mechanism ineffective

Clustering Results Based on Cause

Failure has no impactNot due to notifications

Page 25: Understanding the Effects and Implications of Compute Node Failures in

25

More reducers impacted

Notification mechanismdetects failure

Timeouts do not fire.

Notification mechanism detects failure in:• Few cases• Specific moment in the job

Clustering Results Based on Cause

Page 26: Understanding the Effects and Implications of Compute Node Failures in

26

• R complains too much? (failed/ total attempts)

e.g. 3 out of 3 failedGive me data!

Side Effects: Induced Reducer DeathFailures propagate to healthy tasks

Negative Effects: • Time and resource waste for re-execution• Job failure - a small number of runs fail completely

XMGR M does not answer !! R

Unlucky reducers die early

M

Page 27: Understanding the Effects and Implications of Compute Node Failures in

27

• R stalled for too long? (no new succ. attempts)

Side Effects: Induced Reducer Death

MGR M does not answer !! R

Give me data!

X

All reducers may eventually dieFundamental problem:• Inferring task failures from connection failures• Connection failures have many possible causes• Hadoop has no way to distinguish the cause (src?

dst?)

M

Page 28: Understanding the Effects and Implications of Compute Node Failures in

28

CDF

More Reducers: 4/Node = 56 Total

Job running time spread out even moreMore reducers = more chances for explained effects

Page 29: Understanding the Effects and Implications of Compute Node Failures in

29

Effect of DataNode Failures

TaskTracker

M R

DataNode

Page 30: Understanding the Effects and Implications of Compute Node Failures in

30

Timeouts When Writing Data

RMX

Write Timeout (WTO)

Page 31: Understanding the Effects and Implications of Compute Node Failures in

31

Timeouts When Writing Data

RMX

Connect Timeout (CTO)

Page 32: Understanding the Effects and Implications of Compute Node Failures in

32

Effect on Speculative ExecutionOutliers in Hadoop: >1 STD slower than mean progress rate

LowPR

HighPR

AVG

AVG – 1*STD

Outliers

Very high PR

AVG

AVG – 1*STD

Page 33: Understanding the Effects and Implications of Compute Node Failures in

33

Delayed Speculative Execution

M

9 11

M

50sWaiting for

mappers

M

9 11

M

100sMap outputs

read

Avg(PR)- STD(PR)

9 11

150sReducer

write output

Page 34: Understanding the Effects and Implications of Compute Node Failures in

34

Delayed Speculative Execution

9 11

200sFailure occurs

Reducers timeoutR9 speculatively exec

9 11!

9

> 200sNew R9 skews stats

M

M

Very low

400sR11 finally speculatively

exec.

11 11!

Finally low enough

WTOWTO WTO WTO WTO

Page 35: Understanding the Effects and Implications of Compute Node Failures in

35

Delayed Speculative Execution

• Hadoop’s assumptions about progress rates invalidated

• Stats skewed by very fast speculated task

• Significant impact on job running time

9

Very low

Page 36: Understanding the Effects and Implications of Compute Node Failures in

36

52 reducers – 1 Wave

Reducers stuck in WTODelayed speculative execution

CTO after WTOReconnect to failed DataNode

Page 37: Understanding the Effects and Implications of Compute Node Failures in

37

Delayed SE – A General Problem• Failures and timeouts are not the only cause

• To suffer from delayed SE :• Slow tasks that benefit from SE• I showed the ones stuck in a WTO• Other: slow or heterogeneous nodes,

slow transfers (heterogeneous networks)

• Fast advancing tasks • I showed varying data input availability• Other: varying task input size

varying network speed

Statistical SE algorithms need to be carefully used

Page 38: Understanding the Effects and Implications of Compute Node Failures in

38

Conclusion - Inefficiencies Under Failures

• Task Tracker failures– Large, variable and unpredictable job running times– Variable efficiency depending on reducer number– Failures propagate to healthy tasks– Success of TCP connections not enough

• Data Node failures– Delayed speculative execution– No sharing of potential failure information (details in paper)

Page 39: Understanding the Effects and Implications of Compute Node Failures in

39

Ways Forward• Provide dynamic info about infrastructure to

applications (at least in the private DCs)

• Make speculative execution cause aware– Why is a task slow at runtime?– Move beyond statistical SE algorithms– Estimate PR of tasks (use envir, data characteristics)

• Share some information between tasks– In Hadoop tasks rediscover failures individually– Lots of work on SE decisions (when, where to SE)– This decisions can be invalidate by such runtime

inefficiencies

Page 40: Understanding the Effects and Implications of Compute Node Failures in

40

Thank you

Page 41: Understanding the Effects and Implications of Compute Node Failures in

41

Backup slides

Page 42: Understanding the Effects and Implications of Compute Node Failures in

42Large variability in job running times

Experiment: ResultsGroup G2

Group G6 Group G7

Group G3

Group G5

Group G1

Group G4

Page 43: Understanding the Effects and Implications of Compute Node Failures in

43

Group G1 – few reducers impacted

Slow recovery when few reducers impacted

M1

R1

M1 copied by all reducers before failure.

R1_1X

JobTracker

After failure R1_1 cannot access M1.R1_1 needs to send 3 notifications ~ 1250sTask Tracker declared dead after 600-800s

M2M3 Notif

(M1)

R2

R3

Page 44: Understanding the Effects and Implications of Compute Node Failures in

44

Group G2 – timing of failure

Timing of failure relative to Job Tracker checks impacts job running time

TimeG1

G2

170s

170sTime

Job end

600s

600s

200s

200s difference between G1 and G2.

Page 45: Understanding the Effects and Implications of Compute Node Failures in

45

Group G3 – early notifications

Early notifications increase job running time variability

• G1 notifications sent after 416s

• G3 early notifications => map outputs declared lostCauses:

• Code-level race conditions• Timing of a reducer’s shuffle attempts

R2

X

M5R2

X

0 1 2 3 4 5 6

M5-1M6-1

M5-2M6-2

M5-3M6-3

M5-4M6-4

M6-1 M5-1M6-2

M5-2M6-3

M5-3M6-4

0 1 2 3 4 5 6

M5

M6

M6

M5-4M6-5

Page 46: Understanding the Effects and Implications of Compute Node Failures in

46

Group G4 & G5 – many reducers impacted

Job running time under failure varies with nr of reducers impacted

R1_1

X

JobTracker

G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared deadG5 - Same as G4 but early notifications are sent

Notif(M1,M2,M3,M4,M5)

M1

R1M2M3

R2

R3

Page 47: Understanding the Effects and Implications of Compute Node Failures in

47

Task Tracker Failures

Gew reducers impacted. Not enough notifications.Timeouts fire.

Many reducers impacted.Enough notifications sentTimeouts do not fire

LARGE, VARIABLE, UNPREDICTABLE job running timesEfficiency varies with number of affected reducers

Page 48: Understanding the Effects and Implications of Compute Node Failures in

48

CDF

Node Failures: No RST Packets

No RST -> No Notifications -> Timeouts always fire

Page 49: Understanding the Effects and Implications of Compute Node Failures in

49

Not Sharing Failure InformationDifferent SE algorithm(OSDI 08)

Tasks SE even before failure.

Delayed SE not the cause.

Both initial and SE task connect to failed nodeNo sharing of potential failure information

Page 50: Understanding the Effects and Implications of Compute Node Failures in

50

t Outlier: avg(PR(all)) – std(PR(all)) > PR(t)

limit

R9 R11

Delayed Speculative Execution

Stats skewed by very fast speculative tasks.Hadoop’s assumptions about prog. rates invalidated

M M

91111

WTO

WTO

911

Page 51: Understanding the Effects and Implications of Compute Node Failures in

51

Timeline:• ~50s reducers wait for map outputs

• ~100s reducers get map outputs

• ~200s failure => reducers timeout

• ~200s R9 speculatively executed huge progress rate

statistics skewed

• ~400s R11 finally speculatively executed

Delayed Speculative Execution

Stats skewed by very fast speculative tasks.Hadoop’s assumptions about prog. rates invalidated

M M

91111

WTOWTO

911