43
A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance by Design: Computer Capacity Planning by Example Prentice Hall, 2004

A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

Embed Size (px)

Citation preview

Page 1: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

A Data Center

by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak

Case Study

Source:

D. Menasce, V.A. Almeida, L.W. Dowdy

Performance by Design: Computer Capacity Planning by Example

Prentice Hall, 2004

Page 2: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

2

Table of Contents:

• Introduction

• The Data Center

• First Model Attempt: Markov Chain

• Tasks

• Second Model Attempt: Two-Device QN

• Cost Analysis

Page 3: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

3

Introduction

Data centers offer a variety of services Trend: service-based data centers Problems:

Compliance with SLA default tolerance, privacy, security (...)

Too expensive How to choose the optimal size?

( cost)

Page 4: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

4

The Data Center

Machine-Repair-Model: M machines (functionally identical) N repair people Diagnostic system:

Detect failures of the machines Maintain a queue of machines waiting to be

repaired Log failure time record repair times

Page 5: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

5

GSPN-Model

MiO Machines in operation

MBR Machines being repaired

MWR Machines waiting to be repaired

(Sharpe)

Failure rate

Repair rate

Page 6: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

6

Queueing Model

Machines waiting to be repaired

Machines in operation

Machines being repaired

Page 7: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

7

Parameters Failure rate

1/ MTTF (Mean Time to Failure)

Repair rate

1/ Time to repair a machine

MTTR Mean Time to Repair

MTBF Mean Time Between Failures

Page 8: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

8

Building a Model~1~

Example: Markov Chain

k number of failed machines

k →k+1 transition when a machine fails

k →k-1 transition when a machine is repaired

λk = (M-k)λ aggregate failure rate

MNkN

Nkkk ),...,1(

,...,1

aggregate repair rate

Page 9: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

9

Building a Model~2~

1-dim. Generalized Birth-Death (GBD)

0,1,2,...k 1

0 10

k

i i

ik pp

M-k machines in operation

Page 10: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

10

Building a Model~3~

Average aggregate rate at which machines fail

(which equals average aggregate rate at which

machines are repaired):

1

0

1

0

)(M

kk

M

kkkf pkMpX

Page 11: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

11

Building a Model~4~

Interactive Response Time Law:

1

ff X

MMTTF

X

MMTTR

Client work station ↔ machines in operation

Average think time Z ↔ MTTF

Average response time R ↔ MTTR

System throughput fXX 0

ZX

MR

0

Page 12: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

12

Building a Model~5~

Little´s Law: (Box of reparation)

f

ff

XMMTTRXN

R ↔ MTTR

Nf = average number of failed machines

XRN

fXX

Page 13: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

13

Building a Model~6~

Little´s Law: (operational machines)

R ↔ MTTF

No = average number of operational machines

XRN

fXX

f

fo

XMTTFXN

)( 0 fNNM

Page 14: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

14

Values for the Example

120 machines

MTTF = 500 min

= 0.002 per min

Time to repair a machine = 20 min

= 0.05 per min

Page 15: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

15

Task 1

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120• repair rate of machines = 0.05 per min

What is the probability that exactly j machines are operational?

Page 16: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

16

Task 1

Use:

pexactly j machines in operation = pM-j

MNkN

kN

K

Mp

NkK

Mp

pkNk

k

k

),...,1(!

!

,...,1

0

0

1

0 10 !

!

N

k

M

Nk

kNkk

N

kN

K

M

K

Mp

Page 17: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

17

Task 1 N = 2,5,10

Page 18: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

18

Task 2

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120• number of repair people N• repair rate of machines = 0.05 per min

What is the probability Pj that at least j

machines are operational ?

Page 19: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

19

Task 2

Use Task 1 and:

once the personnel becomes overloaded, the system tends towards failure

if M>>N: having extra machines is pointless

M

jiiMj pP

Page 20: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

20

Task 3

Given is

• failure rate of machines = 0.002 per min• number of machines M = 120

• wanted probability: Pj = 0.9

• Time to repair a machine = 20 per min

How many repair people are necessary to guarantee that at least two thirds of the machines are operational with Pj = 0.9 ?

Page 21: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

21

Task 2,3 N = 2,3,4,5,10

Page 22: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

22

Task 4Given are the values

13

120 machines

MTTF = 500 min

= 0.002 per min

Time to repair a machine = 20 min

= 0.05 per min

What is the effect of the size of the repair team, N, on the MTTR a machine ?

Page 23: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

23

Task 4

computation

1 5

U s e :

P e x a c t l y j m a c h i n e s i n o p e r a t i o n = P M - j

MNkN

kN

K

Mp

NkK

Mp

pkNk

k

k

),...,1(!

!

,...,1

0

0

N

k

M

Nk

kNkk

N

kN

K

Mp

K

Mpp

0 1000 !

!

1. p0

2. pk

Page 24: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

24

Task 4

computation

1. p0

2. pk

fX.3

9

B u i l d i n g a M o d e l~ 3 ~

A v e r a g e a g g r e g a t e r a t e a t w h i c h m a c h i n e s f a i l

e q u a l s a v e r a g e a g g r e g a t e r a t e a t w h i c h

m a c h i n e s a r e r e p a i r e d :

1

0

1

0

)(M

kk

M

kkkf pkMpX

Page 25: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

25

Task 4

computation

1. p0

2. pk

4. MTTR

1 0

B u i l d i n g a M o d e l~ 4 ~

1

ff X

MMTTF

X

MMTTR

fX.3

Page 26: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

26

Task 4

computation

1. p0

2. pk

4. MTTR

5. No

1 2

B u i l d i n g a M o d e l~ 6 ~

f

fo

XMTTFXN

fX.3

Page 27: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

27

Task 4

computation

1. p0

2. pk

4. MTTR

5. No

6. Nf 1 1

B u i l d i n g a M o d e l~ 5 ~

f

ff

XMMTTRXN

fX.3

Page 28: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

28

Task 4 Effect of Number of Repair People

N repair peopleNO average number of operational machinesNf average number of failed machinesMTTR Mean Time to Repair

Page 29: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

29

Task 4

• number of repair people is increased beyond 5, further decreases in the MTTR is minimal

with 5 repair people: • 111 machines operational• down time of 38 minutes

(MTTR = 38 min: 20 min repair, 18 min wait)

Page 30: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

30

Task 4

case N = M =120:

11ff XMTTRMTTFXM

M

MTTFXN fo

M

X f

Page 31: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

31

Task 5Given are the values

13

120 machines

MTTF = 500 min

= 0.002 per min

N = 5

What is the effect of a repair person´s skill level on the overall down time ?

Page 32: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

32

Task 5Given are the values

13

120 machines

MTTF = 500 min

= 0.002

N = 5

How does the skill level affect the percentage of operational machines ?

Page 33: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

33

Task 5 Effect of the Repair Rate

NO average number of operational machinesNf average number of failed machinesMTTR Mean Time to Repair

Page 34: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

34

Second Modeling Attempt~1~

The Failure-recovery-model can also be modeled by a two-device QN:

• 1st device: delay server( Machines in Operation)

• 2nd device: load-dependent server( repair people)

Page 35: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

35

Second Modeling Attempt~2~

Delay server:

A fixed machine goes into operation without queuing.

The time a machine is valid depends only on its MTTF.

Page 36: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

36

Second Modeling Attempt~3~

Load-dependent server:

total rate at which machines are repaired (TRMR) depends on:

- number of failed machines k

- number of repair people N

service rate:

MNkN

Nkkk

),...,1(

....,,.........1)(

Page 37: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

37

Second Modeling Attempt~4~

Use MVA method with load-dependent devices for solving this model

required: service rate´multipliers

, k=1,...,M (s.Chp 14)

MNkNN

Nkkk

k),...,1(

....,,.........1)(

)1(

)()(

k

k

Page 38: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

38

Second Modeling Attempt~5~

The solution of this MVA model gives us:

• average throughput:

• average residence time at the LD-device:

= MTTR

X

´

LDR

Little´s Law to LD device:

av. number of failed machines:

av. number of machines in op.:

´LDf RXN

fNMN 0

Page 39: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

39

A Cost Analysis

Cp annual personnel cost

Cm annual cost per machine

constant revenue multiplier No average number of machines in operation

Mmin minimum number of machines that need to be in operation for the data center not to have to pay a penalty

Cα cost

Rα revenue

Page 40: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

40

A Cost Analysis

cost:

revenue:

profit:

mp CMCNC

minMNR o

mpo CMCNMNCRP min

Page 41: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

41

A Cost Analysis

Page 42: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

42

A Cost Analysis

negative profit for low numbers of personnel, because of low machine availability

with more than 6 personnel costs increases more then revenue, thus 6 service personnel are optimal

Page 43: A Data Center by Ulrike Talbiersky, Holger Wichert, Christian Lohrengel, André Augustyniak Case Study Source: D. Menasce, V.A. Almeida, L.W. Dowdy Performance

43

References

Skripts And Talks Of Menasce CS672_Performance

cs672-07CaseStudy-III-DataCenter.pdf

cs672-03QuantifyingPerformanceModels.pdf

Skript SN1

Haverkort: Computer Communication Systems

Performance Analysis