Upload
gavin-asay
View
218
Download
0
Embed Size (px)
Citation preview
Scheduling Algorithms for Scheduling Algorithms for Unpredictably Unpredictably
Heterogeneous CMP Heterogeneous CMP ArchitecturesArchitectures
Scheduling Algorithms for Scheduling Algorithms for Unpredictably Unpredictably
Heterogeneous CMP Heterogeneous CMP ArchitecturesArchitectures
J. Winter and D. Albonesi, Cornell UniversityJ. Winter and D. Albonesi, Cornell University
International Conference on Dependable Systems and International Conference on Dependable Systems and
Networks, 2008Networks, 2008
J. Winter and D. Albonesi, Cornell UniversityJ. Winter and D. Albonesi, Cornell University
International Conference on Dependable Systems and International Conference on Dependable Systems and
Networks, 2008Networks, 2008
Paper OverviewPaper OverviewPaper OverviewPaper Overview
““Uniform Cores” are not uniform.Uniform Cores” are not uniform.
E.g., an 8-core Intel Xeon processor is heterogeneous in E.g., an 8-core Intel Xeon processor is heterogeneous in
the sense that the cores do not perform identically, due the sense that the cores do not perform identically, due
to hard errors, process variations, etc.to hard errors, process variations, etc.
It would be nice to schedule applications on the cores with It would be nice to schedule applications on the cores with
the heterogeneity in mind, to match the capabilities of the heterogeneity in mind, to match the capabilities of
degraded cores with the applicationsdegraded cores with the applications
Three algorithms: Hungarian, Global Search, Local Three algorithms: Hungarian, Global Search, Local
SearchSearch
Goal: reduce Goal: reduce EDED22 over naïve assignment. over naïve assignment.
Why can’t we make uniform Why can’t we make uniform processors?processors?
Why can’t we make uniform Why can’t we make uniform processors?processors?
There’s Not So Much Room at the Bottom (with There’s Not So Much Room at the Bottom (with
apologies to R. Feynman)apologies to R. Feynman)
As transistors and wires shrink, the number of As transistors and wires shrink, the number of
hard errors increases per die, and they also wear hard errors increases per die, and they also wear
out faster.out faster.
Yields would be too low if all faulty (non-fatal) cores Yields would be too low if all faulty (non-fatal) cores
were thrown out.were thrown out.
Processors will therefore ship with “unpredictably Processors will therefore ship with “unpredictably
heterogeneous” cores.heterogeneous” cores.
What can we do about What can we do about unpredictable heterogeneous unpredictable heterogeneous
cores?cores?
What can we do about What can we do about unpredictable heterogeneous unpredictable heterogeneous
cores?cores?Hardware solutionsHardware solutions
Redundancy, fault diagnosis, defect tolerance: good Redundancy, fault diagnosis, defect tolerance: good
solutions to certain aspects of the problem, but solutions to certain aspects of the problem, but
does not address schedulingdoes not address scheduling
Hardware/Software solutions (this paper)Hardware/Software solutions (this paper)
Hardware can provide feedback on performance Hardware can provide feedback on performance
and power dissipation.and power dissipation.
Operating System handles global balancing Operating System handles global balancing
requirementsrequirements
Assumptions and Assumptions and MethodologyMethodology
Assumptions and Assumptions and MethodologyMethodology
Assumptions:Assumptions:
Application behavior changes slowlyApplication behavior changes slowly
Interaction between applications is limitedInteraction between applications is limited
Methodology:Methodology:
Reduce scheduling problem to Assignment Reduce scheduling problem to Assignment
ProblemProblem
Hungarian Algorithm or Iterative OptimizationHungarian Algorithm or Iterative Optimization
Related WorkRelated WorkRelated WorkRelated Work
Permanent failure toleration techniquesPermanent failure toleration techniques
Redundancy to tolerate hard errors, and fault isolation and Redundancy to tolerate hard errors, and fault isolation and
diagnosis leading to reconfigurationdiagnosis leading to reconfiguration
Mitigation of manufacturing process variationsMitigation of manufacturing process variations
System-level, fabrication techinquesSystem-level, fabrication techinques
Using the operating system to improve CMP energy efficiencyUsing the operating system to improve CMP energy efficiency
Use Dynamic Voltage and Frequency Scaling based on workloadUse Dynamic Voltage and Frequency Scaling based on workload
Thermal ControlThermal Control
Most previous work deals with homogenous chip systemsMost previous work deals with homogenous chip systems
Scheduling Scheduling AlgorithmsAlgorithmsScheduling Scheduling AlgorithmsAlgorithms
Methodology: Assign applications to cores over a fixed, short Methodology: Assign applications to cores over a fixed, short
period of time. Reassess periodically.period of time. Reassess periodically.
Algorithms use the sampling data for the decision.Algorithms use the sampling data for the decision.
Hungarian Algorithm:Hungarian Algorithm:
Solves the “Assignment Problem” by assuming no Solves the “Assignment Problem” by assuming no
interactions between threads and static program interactions between threads and static program
performance.performance.
Uses normalized energy-delay-squared (Uses normalized energy-delay-squared (EDED22) sample ) sample
results.results.
O(NO(N33)) complexity complexity
Scheduling Algorithms Scheduling Algorithms (continued)(continued)
Scheduling Algorithms Scheduling Algorithms (continued)(continued)
Iterative Optimization Algorithms (Using AI Iterative Optimization Algorithms (Using AI
approach)approach)
Simple to implement, greedy.Simple to implement, greedy.
Global SearchGlobal Search
Random schedule each interval, and OS keeps Random schedule each interval, and OS keeps
track of best configuration.track of best configuration.
Plus: Fast exploration. Minus: Does not always Plus: Fast exploration. Minus: Does not always
provide optimal solutionprovide optimal solution
Global SearchGlobal SearchGlobal SearchGlobal Search
Scheduling Algorithms Scheduling Algorithms (continued)(continued)
Scheduling Algorithms Scheduling Algorithms (continued)(continued)
Local SearchLocal Search
Uses a “neighborhood” of assignments that are Uses a “neighborhood” of assignments that are
closely related to the current configuration (using closely related to the current configuration (using
pair-wise swaps)pair-wise swaps)
During exploration, assignments do not change During exploration, assignments do not change
much, and revert back if previous configuration much, and revert back if previous configuration
was better.was better.
Plus: more gradual search that steadily improves.Plus: more gradual search that steadily improves.
Local SearchLocal SearchLocal SearchLocal Search
SimulationSimulationSimulationSimulation
SESC Simulator (a microprocessor SESC Simulator (a microprocessor
architectural simulator”) base.architectural simulator”) base.
Augmented with CACTI, Wattch, Hotspot, Augmented with CACTI, Wattch, Hotspot,
and HotLeakage.and HotLeakage.
4GHz clock frequency, supply voltage of 4GHz clock frequency, supply voltage of
1.0V1.0V
Single-threaded applications from SPEC Single-threaded applications from SPEC
CPU2000CPU2000
Simulation Simulation (continued)(continued)Simulation Simulation (continued)(continued)
Direct interaction among applications on different cores is Direct interaction among applications on different cores is
limited (as per one assumption)limited (as per one assumption)
Intercore heating effects are limited by L2 caches Intercore heating effects are limited by L2 caches
surrounding each core, which act as heat sinks.surrounding each core, which act as heat sinks.
Off-chip memory bandwidth is statically partitioned among Off-chip memory bandwidth is statically partitioned among
corescores
Bottom line: Bottom line: simulation is of “a multi-core processor using simulation is of “a multi-core processor using
single-core simulations to obtain performance, power, and single-core simulations to obtain performance, power, and
thermal statistics that are then combined by a higher level thermal statistics that are then combined by a higher level
chip-wide simulator that performs the role of the OS chip-wide simulator that performs the role of the OS
scheduler”scheduler”
Simulation Simulation (continued)(continued)Simulation Simulation (continued)(continued)
Advantage: Scales to CMPs with a large number of Advantage: Scales to CMPs with a large number of
corescores
Baseline: 8-core homogeneous chip multiprocessor Baseline: 8-core homogeneous chip multiprocessor
with no degradation.with no degradation.
Processor degradation types:Processor degradation types:
Pipeline component disabled (ALU, ROB entries, etc.)Pipeline component disabled (ALU, ROB entries, etc.)
Frequency degradation from process varationsFrequency degradation from process varations
Leakage current variations.Leakage current variations.
Simulation Simulation (continued)(continued)Simulation Simulation (continued)(continued)
Four different Four different
workloads (Table 4)workloads (Table 4)
Each benchmark is Each benchmark is
used evenly among used evenly among
the four workloads.the four workloads.
OS switches between OS switches between
exploration and exploration and
steady-state.steady-state.
ResultsResultsResultsResults
Comparisons are made using the Comparisons are made using the EDED2 2
metric against a baseline with no errors or metric against a baseline with no errors or
variations and perfect scheduling.variations and perfect scheduling.
EDED22 chosen to balance performance with chosen to balance performance with
power dissipationpower dissipation
Results (simple Results (simple scheduling)scheduling)
Results (simple Results (simple scheduling)scheduling)
Round Robin and Round Robin and
Randomized Randomized
algorithms algorithms
degrade degrade EDED22 by by
22% on average22% on average
Worst-case can Worst-case can
degrade degrade EDED22 up to up to
45%45%
Results (advanced Results (advanced scheduling)scheduling)
Results (advanced Results (advanced scheduling)scheduling)
Hungarian: Hungarian:
12.5 million cycle intervals12.5 million cycle intervals
Eight apps are executed Eight apps are executed
on each coreon each core
Rotated seven times for Rotated seven times for
8x8 cost matrix8x8 cost matrix
7.3% increase in 7.3% increase in EDED22
200k cycles to solve cost 200k cycles to solve cost
matrixmatrix
Results (advanced Results (advanced scheduling)scheduling)
Results (advanced Results (advanced scheduling)scheduling)
Global and local search:Global and local search:
25 intervals of 4 million 25 intervals of 4 million
cyclescycles
Global:Global:
Tries initial Tries initial
configuration and 24 configuration and 24
other random configs.other random configs.
19.5% degradation over 19.5% degradation over
baselinebaseline
Results (advanced Results (advanced scheduling)scheduling)
Results (advanced Results (advanced scheduling)scheduling)
Local:Local:
Three versions:Three versions:
N=1,2,4; N pair-wise N=1,2,4; N pair-wise
swapsswaps
N=2,4: beneficial pair-N=2,4: beneficial pair-
wise swaps are kept, wise swaps are kept,
others discarded.others discarded.
15%, 12.6%, 7.8% 15%, 12.6%, 7.8%
degradation.degradation.
Results (overall)Results (overall)Results (overall)Results (overall)
Comparison between Comparison between
degraded and non-degraded and non-
degraded systemsdegraded systems
Offline Oracle performs Offline Oracle performs
better on better on EDED22 because because
some degraded cores some degraded cores
operate at lower power.operate at lower power.
Hungarian and Local Hungarian and Local
Search 4 perform at Search 4 perform at
almost baseline.almost baseline.
ConclusionsConclusionsConclusionsConclusions
CMPs will be affected by more and more variations CMPs will be affected by more and more variations
and hard errors as the technology scales down, and hard errors as the technology scales down,
creating heterogeneity in otherwise uniform cores.creating heterogeneity in otherwise uniform cores.
Naïve scheduling on such cores leads to Naïve scheduling on such cores leads to
detrimental detrimental EDED22 performance. performance.
Under limited core-core interaction, the scheduling Under limited core-core interaction, the scheduling
problem reduces to the Assignment Problem, and problem reduces to the Assignment Problem, and
can be solved with Hungarian Algorithm. Certain can be solved with Hungarian Algorithm. Certain
AI schedulers work well, too.AI schedulers work well, too.
Comments from Comments from WikiWiki
Comments from Comments from WikiWiki