ECE 510 Brendan Crowley Paper Review October 31, 2006

ECE 510Brendan Crowley

Paper ReviewOctober 31, 2006

“Processor Power Reduction Via Single-ISA

Heterogeneous Multi-Core Architectures”

Rakesh Kumar, Keith Farkas, Norman P. Jouppi, Partha

Ranganathan, Dean M. Tullsen

Presentation Overview Introduction The Architecture Modeling the Architecture Results Critical Analysis / Conclusion

Introduction Background

Processors continue to have increased speed and transistor count as transistor sizes decrease

This leads to increased power consumption which causes problems

Heat dissipation Chip failure Battery life

Designers are always searching for new ways to decrease power consumption

Introduction (2) Most work on reducing power consumption

falls under one of two categories: Voltage and frequency scaling “Gating” – the ability to turn on/off portions of the

core Some designs have included the use of

multiple identical (homogeneous) cores Others have included processors with co-

processors that run a different instruction set

Introduction (3) The Main Idea

Different software applications have different resource requirements

This fact leads the authors to believe that core diversity is of greater value than uniformity

Therefore, proposed design is a single-ISA heterogeneous multi-core architecture

Each core runs the same instruction set, but has different abilities and performance characteristics

The Architecture One method is to take a family of

previously designed cores, modify their interfaces, and combine them on one die

Each core executes same instruction set, but contains different resources, and therefore achieves different performance and energy efficiency on the same application

The Architecture (2) The operating system determines the

application’s requirements and decides which core is best to use (which core will be the most energy efficient)

To accommodate a wide variety of applications, the cores should have a wide range of performances

The Architecture (3) Authors chose a 5-core design, using

existing cores with a few changes: Hypothetical single-threaded version of the

EV8 (Alpha 21464), which they call the “EV8-” MIPS R4700 EV4 (Alpha 21064) EV5 (Alpha 21164) EV6 (Alpha 21264)

The Architecture (4) Assumptions

Each core has a private L1 data and instruction cache

All cores share an L2 cache, phase-locked-loop circuitry and pins

Implemented in 0.10 micron technology One application running at a time (one thread

running)

The Architecture (5) Relative core sizes

The Architecture (6) Different parts of a program may require

different resources To take full advantage of the core diversity

it is necessary to switch between cores in the middle of program execution This is done at operating system timeslice

intervals, with user-state already saved to memory

If the OS decides to switch cores, the data is saved to the shared L2 cache, where the next core can retrieve it

The Architecture (7) The authors assume the unused cores are

powered down to avoid static leakage and dynamic switching power This means time must be spent powering up

the cores Experimental results show that this

doesn’t affect performance when core-switching is done at OS timer intervals, even with pessimistic assumptions about power-up time and software overhead

Modeling the Architecture Data on the EV8 was based on some

predictions and reported data Data on the other cores was from

published literature Assume all of the alpha cores run at

2.1GHz (since they assume 0.10 micron process), and the R4700 runs at 1GHz

Modeling the Architecture (2) All architectures were modeled as

accurately as possible on a highly detailed instruction-level simulator, using the configurations in the table below

Modeling the Architecture (3) The table below shows the area and peak

power statistics of the cores Areas were found from die photos Total Die area is approximately 400mm2

Modeling the Architecture (4) Benchmark execution simulated using

SMTSIM Simulator was modified to simulate a

multi-core processor with a shared L2 cache

Assume a single thread running on one core at a time

Switching cores requires the active core’s pipeline to be flushed and writing back the L1 cache lines to the L2 cache

Results The following figure shows results for the

SPEC application applu The Y-axis, IPS2/W, is basically the inverse

of power-delay product Constraint:

Never choose a core that sacrifices more than 50% performance relative to EV8- over an interval

Results (2)

Results (3) Compared to a single-core architecture,

this design could ideally reduce the PDP by 74% Combination of 25% performance loss and 81%

energy savings Could change the constraint to achieve

greater PDP savings (sacrificing performance, of course)

Another design point gives 36% energy savings with 4% performance loss

Results (4) Could optimize other metrics besides PDP,

depending on the design goals Different power and performance tradeoffs

can be made simply by changing the core switching algorithm (no need to change the hardware)

Critical Analysis / Conclusion There are a lot of assumptions made about

things like frequency scaling, power consumption of cores, etc.

This paper only reports results for one benchmark application

Multiple cores/threads running at the same time would likely be used in practice How would this affect the core switching

complexity and latency

Critical Analysis / Conclusion (2) This technique seems like a very good one

Homogeneous multi-core chips are already on the market

Potential for significant energy savings

Documents

ECE 510 Brendan Crowley Paper Review October 31, 2006