25
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

On Modeling the Lifetime Reliability of Homogeneous

Manycore SystemsLin Huang and Qiang Xu

CUhk REliable computing laboratory (CURE)

The Chinese University of Hong Kong

Page 2: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Integrated Circuit (IC) Product Reliability

IC errors can be broadly classified into two categories● Soft errors

• Do not fundamentally damage the circuits

● Hard errors• Permanent once manifest

• E.g., time dependent dielectric breakdown (TDDB) in the gate oxides, electromigration (EM) and stress migration (SM) in the interconnects, and thermal cycling (TC)

Page 3: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Manycore Systems

State-of-the-art computing systems have started to employ multiple cores on a single die● General-purpose processors, multi-digital signal processor systems

● Power-efficiency

● Short time-to-market

Source: Intel Source: Nvidia

Page 4: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Problem Formulation

To model the lifetime reliability of homogeneous manycore systems using a load-sharing nonrepairable k-out-of-n: G system with general failure distributions

Key features● k-out-of-n: G systems: to provide fault tolerance

● Load-sharing: each embedded core carries only part of the load assigned by the operating system

● Nonrepairable: embedded cores are integrated on a single silicon die

● General failure distribution: embedded cores age in operation

Page 5: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Queueing Model for Task Allocation

Embedded cores execute tasks independently and one core can perform at most one task at a time

Consider a manycore system composed of a set identical embedded cores● The set of active cores , spare cores , and faulty cores

λa

Set S1

Set S2 S3

Processor Cores

Central Task Allocation Queue

Applications

Page 6: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Queueing Model for Task Allocation

A general-purpose parallel processing system with a central queue with a bulk arrival is modeled as queueing system

The probability that a certain active core is occupied by tasks (also called utilization) is computed as

Target system● Gracefully degrading systems

● Standby redundant systems

λa

Set S1

Set S2 S3

Processor Cores

Central Task Allocation Queue

Applications

Page 7: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of Entire System– Gracefully Degrading System

A functioning manycore system may contains good cores

Let be the probability that the system has active cores at time

The system reliability can therefore be expressed as

Thus, the Mean Time to Failure (MTTF) of the system can be written as

Page 8: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of Entire System– Gracefully Degrading System

To determine●

● • Conditional probability

● For any• Conditional probability

The remaining is how to compute

Page 9: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Behavior of Single Processor Core

States of cores● Spare mode – cold standby

● Active mode• Processing state

• Wait state – warm standby

The same shape but different scaleparameter● E.g.,

Active

Spare(Cold

Standby)

Wait(Warm

Standby)

Processing

Page 10: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

CoreCore Core Core Core

Lifetime Reliability of A Single Core – Gracefully Degrading System

Define accumulated time in a certain state at time as how long it spends in such a state up to time

Calculation

Page 11: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Gracefully Degrading System

Theorem 1 Suppose a manycore system with gracefully degrading scheme has experienced core failures, in the order of occurrence time at , respectively, for any core that has survived until time● its accumulated time in the processing state up to time

● its accumulated time as warm standby up to time

Page 12: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Gracefully Degrading System

Recall that the reliability functions in wait and processing states have the same shape but different scale parameter● General reliability function , abbreviated as

● Reliability function in processing state , denoted as

● Reliability function in wait state , denoted as

● Relationships: and

Page 13: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Gracefully Degrading System

A subdivision of the time :

By the continuity of reliability function, we have

wait processing wait

Accumulated time in the processing state

Accumulated time in the wait state

Page 14: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Gracefully Degrading System

Theorem 2 Given a gracefully degrading manycore system that has experienced core failures which occur at respectively, the probability that a certain core survives at time provided that it has survived until time is given by

where

Page 15: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of Entire System– Standby Redundant System

A standby redundant system is functioning if it contains at least good cores, among which are configured as active one, the remaining are spares

To determine● Again, the key point is to compute

Page 16: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Standby Redundant System

Define a core’s birth time as the time point when it is configured as an active one

Theorem 3 In a standby redundant manycore system, for any core with birth time that has survived until time● its accumulated time in the processing state up to time

● its accumulated time as warm standby up to time

Page 17: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability of A Single Core – Standby Redundant System

Theorem 4 In a manycore system with standby redundant scheme, the probability that a certain core with birth time survives at time

is given by

where

Page 18: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Experimental Setup

Lifetime distributions● Exponential

● Weibull

● Linear failure rate

System parameters●

Consider a manycore system consisting of cores

Page 19: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Misleading Caused by Exponential Assumption

Redundancy Scheme

Sojourn Time (years)

0-Failure State

1-Failure State

2-Failure State

3-Failure State

4-Failure State

0 — 0.2188 — — — — 0.2188

1Degrading 0.2121 0.2188 — — — 0.4309

Standby 0.2188 0.2188 — — — 0.4376

2Degrading 0.2059 0.2121 0.2188 — — 0.6368

Standby 0.2188 0.2188 0.2188 — — 0.6564

3Degrading 0.2000 0.2059 0.2121 0.2188 — 0.8368

Standby 0.2188 0.2188 0.2188 0.2188 — 0.8752

4Degrading 0.1944 0.2000 0.2059 0.2121 0.2188 1.0312

Standby 0.2188 0.2188 0.2188 0.2188 0.2188 1.0940

: Expected lifetime of the -core system

Page 20: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Lifetime Reliability for Non-Exponential Lifetime Distribution

(a) Weibull Distribution (b) Linear Failure Rate Distribution

Page 21: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Detailed Results for Gracefully Degrading System

Distribution

Sojourn Time (years)

0-Failure State

1-Failure State

2-Failure State

3-Failure State

4-Failure State

Weibull

0 2.2039 — — — — 2.2039

1 2.2153 0.5573 — — — 2.7726

2 2.2260 0.5600 0.3055 — — 3.0915

3 2.2359 0.5626 0.3142 0.1040 — 3.2167

4 2.2452 0.5649 0.2988 0.0955 0.0820 3.2864

Linear Failure Rate

0 1.8572 — — — — 1.8572

1 1.8463 1.1367 — — — 2.9830

2 1.8354 1.1325 0.8926 — — 3.8605

3 1.8243 1.1282 0.8798 0.6941 — 4.5264

4 1.8133 1.1237 0.8762 0.7055 0.6269 5.1456

Page 22: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

The Impact of Workload

Page 23: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Comparison Between Gracefully Degrading System and Standby Redundant System

DistributionRedundancy

Scheme Hot Standby

Warm StandbyCold

Standby

Weibull

2Degrading 1.5039 1.8232 2.1497 2.2930 2.4265 2.6258

Standby 1.5314 1.8227 2.1133 2.2488 2.3484 2.5309

4Degrading 1.5046 1.8521 2.2305 2.4432 2.5771 2.8376

Standby 1.5577 1.8545 2.1715 2.3103 2.4266 2.6261

Linear Failure Rate

2Degrading 1.9115 2.3197 2.7070 2.8697 3.0105 3.2424

Standby 1.9608 2.3314 2.7330 2.8851 3.0091 3.2146

4Degrading 2.1348 2.7122 3.3642 3.6529 3.9385 4.3590

Standby 2.3008 2.7899 3.4307 3.6015 3.8588 4.1881

Page 24: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Conclusion

State-of-the art CMOS technology enables the chip-level manycore processors

The lifetime reliability of such large circuit is a major concern

We propose a comprehensive analytical model to estimate the lifetime reliability of manycore systems

Some experimental results are shown to demonstrate the effectiveness of the proposed model

Page 25: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University

Thank You for Your Attention!