Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable

Scheduling Algorithms for Scheduling Algorithms for Unpredictably Unpredictably

Heterogeneous CMP Heterogeneous CMP ArchitecturesArchitectures

Scheduling Algorithms for Scheduling Algorithms for Unpredictably Unpredictably

Heterogeneous CMP Heterogeneous CMP ArchitecturesArchitectures

J. Winter and D. Albonesi, Cornell UniversityJ. Winter and D. Albonesi, Cornell University

International Conference on Dependable Systems and International Conference on Dependable Systems and

Networks, 2008Networks, 2008

J. Winter and D. Albonesi, Cornell UniversityJ. Winter and D. Albonesi, Cornell University

International Conference on Dependable Systems and International Conference on Dependable Systems and

Networks, 2008Networks, 2008

Paper OverviewPaper OverviewPaper OverviewPaper Overview

““Uniform Cores” are not uniform.Uniform Cores” are not uniform.

E.g., an 8-core Intel Xeon processor is heterogeneous in E.g., an 8-core Intel Xeon processor is heterogeneous in

the sense that the cores do not perform identically, due the sense that the cores do not perform identically, due

to hard errors, process variations, etc.to hard errors, process variations, etc.

It would be nice to schedule applications on the cores with It would be nice to schedule applications on the cores with

the heterogeneity in mind, to match the capabilities of the heterogeneity in mind, to match the capabilities of

degraded cores with the applicationsdegraded cores with the applications

Three algorithms: Hungarian, Global Search, Local Three algorithms: Hungarian, Global Search, Local

SearchSearch

Goal: reduce Goal: reduce EDED22 over naïve assignment. over naïve assignment.

Why can’t we make uniform Why can’t we make uniform processors?processors?

Why can’t we make uniform Why can’t we make uniform processors?processors?

There’s Not So Much Room at the Bottom (with There’s Not So Much Room at the Bottom (with

apologies to R. Feynman)apologies to R. Feynman)

As transistors and wires shrink, the number of As transistors and wires shrink, the number of

hard errors increases per die, and they also wear hard errors increases per die, and they also wear

out faster.out faster.

Yields would be too low if all faulty (non-fatal) cores Yields would be too low if all faulty (non-fatal) cores

were thrown out.were thrown out.

Processors will therefore ship with “unpredictably Processors will therefore ship with “unpredictably

heterogeneous” cores.heterogeneous” cores.

What can we do about What can we do about unpredictable heterogeneous unpredictable heterogeneous

cores?cores?

What can we do about What can we do about unpredictable heterogeneous unpredictable heterogeneous

cores?cores?Hardware solutionsHardware solutions

Redundancy, fault diagnosis, defect tolerance: good Redundancy, fault diagnosis, defect tolerance: good

solutions to certain aspects of the problem, but solutions to certain aspects of the problem, but

does not address schedulingdoes not address scheduling

Hardware/Software solutions (this paper)Hardware/Software solutions (this paper)

Hardware can provide feedback on performance Hardware can provide feedback on performance

and power dissipation.and power dissipation.

Operating System handles global balancing Operating System handles global balancing

requirementsrequirements

Assumptions and Assumptions and MethodologyMethodology

Assumptions and Assumptions and MethodologyMethodology

Assumptions:Assumptions:

Application behavior changes slowlyApplication behavior changes slowly

Interaction between applications is limitedInteraction between applications is limited

Methodology:Methodology:

Reduce scheduling problem to Assignment Reduce scheduling problem to Assignment

ProblemProblem

Hungarian Algorithm or Iterative OptimizationHungarian Algorithm or Iterative Optimization

Related WorkRelated WorkRelated WorkRelated Work

Permanent failure toleration techniquesPermanent failure toleration techniques

Redundancy to tolerate hard errors, and fault isolation and Redundancy to tolerate hard errors, and fault isolation and

diagnosis leading to reconfigurationdiagnosis leading to reconfiguration

Mitigation of manufacturing process variationsMitigation of manufacturing process variations

System-level, fabrication techinquesSystem-level, fabrication techinques

Using the operating system to improve CMP energy efficiencyUsing the operating system to improve CMP energy efficiency

Use Dynamic Voltage and Frequency Scaling based on workloadUse Dynamic Voltage and Frequency Scaling based on workload

Thermal ControlThermal Control

Most previous work deals with homogenous chip systemsMost previous work deals with homogenous chip systems

Scheduling Scheduling AlgorithmsAlgorithmsScheduling Scheduling AlgorithmsAlgorithms

Methodology: Assign applications to cores over a fixed, short Methodology: Assign applications to cores over a fixed, short

period of time. Reassess periodically.period of time. Reassess periodically.

Algorithms use the sampling data for the decision.Algorithms use the sampling data for the decision.

Hungarian Algorithm:Hungarian Algorithm:

Solves the “Assignment Problem” by assuming no Solves the “Assignment Problem” by assuming no

interactions between threads and static program interactions between threads and static program

performance.performance.

Uses normalized energy-delay-squared (Uses normalized energy-delay-squared (EDED22) sample ) sample

results.results.

O(NO(N33)) complexity complexity

Scheduling Algorithms Scheduling Algorithms (continued)(continued)


Iterative Optimization Algorithms (Using AI Iterative Optimization Algorithms (Using AI

approach)approach)

Simple to implement, greedy.Simple to implement, greedy.

Global SearchGlobal Search

Random schedule each interval, and OS keeps Random schedule each interval, and OS keeps

track of best configuration.track of best configuration.

Plus: Fast exploration. Minus: Does not always Plus: Fast exploration. Minus: Does not always

provide optimal solutionprovide optimal solution

Global SearchGlobal SearchGlobal SearchGlobal Search



Local SearchLocal Search

Uses a “neighborhood” of assignments that are Uses a “neighborhood” of assignments that are

closely related to the current configuration (using closely related to the current configuration (using

pair-wise swaps)pair-wise swaps)

During exploration, assignments do not change During exploration, assignments do not change

much, and revert back if previous configuration much, and revert back if previous configuration

was better.was better.

Plus: more gradual search that steadily improves.Plus: more gradual search that steadily improves.

Local SearchLocal SearchLocal SearchLocal Search

SimulationSimulationSimulationSimulation

SESC Simulator (a microprocessor SESC Simulator (a microprocessor

architectural simulator”) base.architectural simulator”) base.

Augmented with CACTI, Wattch, Hotspot, Augmented with CACTI, Wattch, Hotspot,

and HotLeakage.and HotLeakage.

4GHz clock frequency, supply voltage of 4GHz clock frequency, supply voltage of

1.0V1.0V

Single-threaded applications from SPEC Single-threaded applications from SPEC

CPU2000CPU2000

Simulation Simulation (continued)(continued)Simulation Simulation (continued)(continued)

Direct interaction among applications on different cores is Direct interaction among applications on different cores is

limited (as per one assumption)limited (as per one assumption)

Intercore heating effects are limited by L2 caches Intercore heating effects are limited by L2 caches

surrounding each core, which act as heat sinks.surrounding each core, which act as heat sinks.

Off-chip memory bandwidth is statically partitioned among Off-chip memory bandwidth is statically partitioned among

corescores

Bottom line: Bottom line: simulation is of “a multi-core processor using simulation is of “a multi-core processor using

single-core simulations to obtain performance, power, and single-core simulations to obtain performance, power, and

thermal statistics that are then combined by a higher level thermal statistics that are then combined by a higher level

chip-wide simulator that performs the role of the OS chip-wide simulator that performs the role of the OS

scheduler”scheduler”


Advantage: Scales to CMPs with a large number of Advantage: Scales to CMPs with a large number of

corescores

Baseline: 8-core homogeneous chip multiprocessor Baseline: 8-core homogeneous chip multiprocessor

with no degradation.with no degradation.

Processor degradation types:Processor degradation types:

Pipeline component disabled (ALU, ROB entries, etc.)Pipeline component disabled (ALU, ROB entries, etc.)

Frequency degradation from process varationsFrequency degradation from process varations

Leakage current variations.Leakage current variations.


Four different Four different

workloads (Table 4)workloads (Table 4)

Each benchmark is Each benchmark is

used evenly among used evenly among

the four workloads.the four workloads.

OS switches between OS switches between

exploration and exploration and

steady-state.steady-state.

ResultsResultsResultsResults

Comparisons are made using the Comparisons are made using the EDED2 2

metric against a baseline with no errors or metric against a baseline with no errors or

variations and perfect scheduling.variations and perfect scheduling.

EDED22 chosen to balance performance with chosen to balance performance with

power dissipationpower dissipation

Results (simple Results (simple scheduling)scheduling)

Results (simple Results (simple scheduling)scheduling)

Round Robin and Round Robin and

Randomized Randomized

algorithms algorithms

degrade degrade EDED22 by by

22% on average22% on average

Worst-case can Worst-case can

degrade degrade EDED22 up to up to

45%45%

Results (advanced Results (advanced scheduling)scheduling)


Hungarian: Hungarian:

12.5 million cycle intervals12.5 million cycle intervals

Eight apps are executed Eight apps are executed

on each coreon each core

Rotated seven times for Rotated seven times for

8x8 cost matrix8x8 cost matrix

7.3% increase in 7.3% increase in EDED22

200k cycles to solve cost 200k cycles to solve cost

matrixmatrix



Global and local search:Global and local search:

25 intervals of 4 million 25 intervals of 4 million

cyclescycles

Global:Global:

Tries initial Tries initial

configuration and 24 configuration and 24

other random configs.other random configs.

19.5% degradation over 19.5% degradation over

baselinebaseline



Local:Local:

Three versions:Three versions:

N=1,2,4; N pair-wise N=1,2,4; N pair-wise

swapsswaps

N=2,4: beneficial pair-N=2,4: beneficial pair-

wise swaps are kept, wise swaps are kept,

others discarded.others discarded.

15%, 12.6%, 7.8% 15%, 12.6%, 7.8%

degradation.degradation.

Results (overall)Results (overall)Results (overall)Results (overall)

Comparison between Comparison between

degraded and non-degraded and non-

degraded systemsdegraded systems

Offline Oracle performs Offline Oracle performs

better on better on EDED22 because because

some degraded cores some degraded cores

operate at lower power.operate at lower power.

Hungarian and Local Hungarian and Local

Search 4 perform at Search 4 perform at

almost baseline.almost baseline.

ConclusionsConclusionsConclusionsConclusions

CMPs will be affected by more and more variations CMPs will be affected by more and more variations

and hard errors as the technology scales down, and hard errors as the technology scales down,

creating heterogeneity in otherwise uniform cores.creating heterogeneity in otherwise uniform cores.

Naïve scheduling on such cores leads to Naïve scheduling on such cores leads to

detrimental detrimental EDED22 performance. performance.

Under limited core-core interaction, the scheduling Under limited core-core interaction, the scheduling

problem reduces to the Assignment Problem, and problem reduces to the Assignment Problem, and

can be solved with Hungarian Algorithm. Certain can be solved with Hungarian Algorithm. Certain

AI schedulers work well, too.AI schedulers work well, too.

Comments from Comments from WikiWiki

Comments from Comments from WikiWiki

Documents

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable