View
45
Download
0
Category
Preview:
DESCRIPTION
Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke. W-QUAD (ISCA-35) June 21, 2008. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…” - PowerPoint PPT Presentation
Citation preview
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science1
Olay: Combat the Signs of Aging with Introspective Reliability Management
Authors: Shuguang FengShantanu GuptaScott Mahlke
W-QUAD (ISCA-35)June 21, 2008
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science2
Motivation “Designing Reliable Systems from Unreliable
Components…”- Shekhar Borkar (Intel)
[Srinivasan, DSN‘04] [Borkar, MICRO‘05]
More failures to come Failures will be wearout induced
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science3
Approaches to Reliability
3
DetectDetect DiagnoseDiagnose Repair/reconfigure/recoverRepair/reconfigure/recover
Architecture-level
MarginingMargining Robust cell topologiesRobust cell topologies
Circuit-level
Dynamic thermal mgmt (DTM)Dynamic thermal mgmt (DTM)
Introspective reliability mgmt (IRM)Introspective reliability mgmt (IRM)
High-K dielectricsHigh-K dielectrics PassivationPassivation
Prevent Faults (proactive)
Tolerate Faults (reactive)
or…
Approaches to Reliability
DivaDiva
ArgusArgusWDUWDU Heat-and-RunHeat-and-Run Reliability Banking
Reliability BankingRAMPRAMP
Targeted management based on wearout monitoring
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science4
Not All Cores Are Created Equal Chip-multiprocessors will be subject to severe process
variation
Dynamic thermal/power budgeting can be suboptimal Temperature is only part of the picture Need low-level reliability awareness
Low-level sensors measure physical changes
Wearout-aware management improves reliability enhancement
System reconfiguration Dynamic voltage and frequency scaling (DVFS) Job assignment
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science5
Introspective Reliability Management (IRM)
Filte
ring
and
Ana
lysi
s
Raw
Sen
sor D
ata
Agg
rega
te A
naly
sis
Proc
esse
d D
ata
Virtualization Layer Reliability Assesment
Management Decisions
OS
Scheduled Jobs IRM Policy
Low-level Sensors delay leakage temperature etc.
WDU [MICRO`07] measure propagation delay track statistical trends
Olay track the progression of wearout profile workload behavior generate wearout-aware job schedules
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Per-module Reliability Profile
Activity:
6
Wearout-aware Scheduling
Active Jobs Available Cores
T0
T1
T2
T3
TnJob Schedule
T6
T8
T9
T2
Idle
T0
T10
Idle
T3
Idle
Idle
T7
T4
T11
T5
T1
T1
T10
T9
T2
T4
T0
T8
Idle
T3
Idle
T7
Idle
Idle
T11
T5
T6
T7
T10
T9
T2
Idle
T0
T8
Idle
T3
Idle
T1
T4
Idle
T6
T5
T11
75%75% 15%15% 25%25% 35%35%50%50% 25%25% 45%45% 5%5%10%10% 35%35% 25%25% 85%85%
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science7
Wearout-aware Scheduling
Filte
ring
and
Ana
lysi
s
Raw
Sen
sor D
ata
Agg
rega
te A
naly
sis
Proc
esse
d D
ata
Virtualization Layer Reliability Assesment
OS
Scheduled Jobs IRM Policy
Job-to-Core Binding
Life Remaining
100% 0%
30%
50%
30%
25%
17%
35%
80%
17%
60%
55%
15%
75%
70%
85%
10%
8%
Lightweight
Strong
Heavyweight
Weak
Core
ApplicationT0
T1
T2
T3
Tn
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Wearout-aware Policies GreedyE
Optimizes for early life performance Minimizes premature failures with wear-leveling
8
C0
C1
C2
C3
C4
Cn
Cores
T0
T1
T2
T3
T4
Tn
Jobs
C7
C6
C1
C3
C10
Cn
T12
T3
T9
T5
T4
Tn
C6
C1
C3
C10
C4
Cn
T4
T3
T9
T5
T7
Tn
C1
C3
C10
C4
C0
Cn
T13
T8
T9
T3
T5
Tn
T12T4
T13
T1
T5
T7
T8
T15
T11
T9
T6 T3
T10
T0
T2
T15
Weak
Strong
Light
Heavy
Schedule
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Wearout-aware Policies GreedyE
Optimizes for early life performance Minimizes premature failures with wear-leveling
GreedyL Optimizes for end of life performance Victimizes weak cores to maximize the life of stronger
cores
GreedyA Hybrid of GreedyE and GreedyL Adapts behavior based on system utilization
9
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Temperature TraceSynthetic Benchmarks representative of SPEC2000 suite reduces online profiling complexity
Offline Characterization
SPEC2000 (INT & FP)Execution TracePower Trace
10
Lifetime Reliability Simulation (FACE)
SimAlpha Wattch HotSpot
BenchmarkSuite
Benchmark Profiles
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Reliability Management monitors CMP health wearout-aware scheduling
profiling intelligent heuristics
Simulate CMP Aging tracks progression of wearout mechanisms hierarchical design
Workload Generation emulates OS scheduler temperature traces power traces
Parameter Specification Device lifetimes Utilization pattern
Onl
ine
Sim
ulat
ion
11
Lifetime Reliability Simulation (FACE)Offline Characterization
SimAlpha Wattch HotSpot
BenchmarkSuite
Benchmark Profiles
Workload Simulator
CMP Simulator
Olay
Monte Carlo Engine
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science12
Wearout Modeling Mean time to failure (MTTF)
defines distribution of device lifetimes
Damage accumulation
where α is the degradation rate
TE
NBTI
aNBTI
eV
MTTF
1
T
ZTTYXbTa
TDDB eV
MTTF
1
01011 11 DDD i
ninnn
i
quali MTTFMTTF
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science13
CMP Reliability SimulationCMP
Core
Module
Transistors: multiple mechanisms evolve
independently
Modules: experience load-dependent stress smallest granularity of
temperature modeling
Cores: Alpha 21264-type processor
CMPs: variable number of cores model systematic variation
Transistor
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Evaluation Policies
Random (baseline), GreedyE, GreedyL, GreedyA
Figures of merit Failure distribution Useful work performed prior to system failure
Varied system parameters CMP size System utilization Sensor error
14
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science15
Failure Distribution
w/ 16-coresw/ 16-cores
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science16
Sensitivity to System Utilization
w/ 16-coresw/ 16-cores
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science17
Sensitivity to CMP Size
w/ 100% utilization & GreedyEw/ 100% utilization & GreedyE
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science18
Sensitivity to Sensor Error
w/ 16-cores,100% utilization, & GreedyEw/ 16-cores,100% utilization, & GreedyE
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science19
Conclusions Heterogeneity exists in both CMPs and their
workloads Wearout-aware job assignments effectively exploit
this heterogeneity Real-time health monitoring (low-level sensors)
CMPs augmented with Olay perform up to 20% more useful work
Proper high-level analysis and profiling is essential for enhancing lifetime reliability.
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science20
Questions?
?
Recommended