View
218
Download
1
Embed Size (px)
Citation preview
ECE 510Brendan Crowley
Paper ReviewOctober 31, 2006
“Processor Power Reduction Via Single-ISA
Heterogeneous Multi-Core Architectures”
Rakesh Kumar, Keith Farkas, Norman P. Jouppi, Partha
Ranganathan, Dean M. Tullsen
Presentation Overview Introduction The Architecture Modeling the Architecture Results Critical Analysis / Conclusion
Introduction Background
Processors continue to have increased speed and transistor count as transistor sizes decrease
This leads to increased power consumption which causes problems
Heat dissipation Chip failure Battery life
Designers are always searching for new ways to decrease power consumption
Introduction (2) Most work on reducing power consumption
falls under one of two categories: Voltage and frequency scaling “Gating” – the ability to turn on/off portions of the
core Some designs have included the use of
multiple identical (homogeneous) cores Others have included processors with co-
processors that run a different instruction set
Introduction (3) The Main Idea
Different software applications have different resource requirements
This fact leads the authors to believe that core diversity is of greater value than uniformity
Therefore, proposed design is a single-ISA heterogeneous multi-core architecture
Each core runs the same instruction set, but has different abilities and performance characteristics
The Architecture One method is to take a family of
previously designed cores, modify their interfaces, and combine them on one die
Each core executes same instruction set, but contains different resources, and therefore achieves different performance and energy efficiency on the same application
The Architecture (2) The operating system determines the
application’s requirements and decides which core is best to use (which core will be the most energy efficient)
To accommodate a wide variety of applications, the cores should have a wide range of performances
The Architecture (3) Authors chose a 5-core design, using
existing cores with a few changes: Hypothetical single-threaded version of the
EV8 (Alpha 21464), which they call the “EV8-” MIPS R4700 EV4 (Alpha 21064) EV5 (Alpha 21164) EV6 (Alpha 21264)
The Architecture (4) Assumptions
Each core has a private L1 data and instruction cache
All cores share an L2 cache, phase-locked-loop circuitry and pins
Implemented in 0.10 micron technology One application running at a time (one thread
running)
The Architecture (5) Relative core sizes
The Architecture (6) Different parts of a program may require
different resources To take full advantage of the core diversity
it is necessary to switch between cores in the middle of program execution This is done at operating system timeslice
intervals, with user-state already saved to memory
If the OS decides to switch cores, the data is saved to the shared L2 cache, where the next core can retrieve it
The Architecture (7) The authors assume the unused cores are
powered down to avoid static leakage and dynamic switching power This means time must be spent powering up
the cores Experimental results show that this
doesn’t affect performance when core-switching is done at OS timer intervals, even with pessimistic assumptions about power-up time and software overhead
Modeling the Architecture Data on the EV8 was based on some
predictions and reported data Data on the other cores was from
published literature Assume all of the alpha cores run at
2.1GHz (since they assume 0.10 micron process), and the R4700 runs at 1GHz
Modeling the Architecture (2) All architectures were modeled as
accurately as possible on a highly detailed instruction-level simulator, using the configurations in the table below
Modeling the Architecture (3) The table below shows the area and peak
power statistics of the cores Areas were found from die photos Total Die area is approximately 400mm2
Modeling the Architecture (4) Benchmark execution simulated using
SMTSIM Simulator was modified to simulate a
multi-core processor with a shared L2 cache
Assume a single thread running on one core at a time
Switching cores requires the active core’s pipeline to be flushed and writing back the L1 cache lines to the L2 cache
Results The following figure shows results for the
SPEC application applu The Y-axis, IPS2/W, is basically the inverse
of power-delay product Constraint:
Never choose a core that sacrifices more than 50% performance relative to EV8- over an interval
Results (2)
Results (3) Compared to a single-core architecture,
this design could ideally reduce the PDP by 74% Combination of 25% performance loss and 81%
energy savings Could change the constraint to achieve
greater PDP savings (sacrificing performance, of course)
Another design point gives 36% energy savings with 4% performance loss
Results (4) Could optimize other metrics besides PDP,
depending on the design goals Different power and performance tradeoffs
can be made simply by changing the core switching algorithm (no need to change the hardware)
Critical Analysis / Conclusion There are a lot of assumptions made about
things like frequency scaling, power consumption of cores, etc.
This paper only reports results for one benchmark application
Multiple cores/threads running at the same time would likely be used in practice How would this affect the core switching
complexity and latency
Critical Analysis / Conclusion (2) This technique seems like a very good one
Homogeneous multi-core chips are already on the market
Potential for significant energy savings