Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SCALCORE: DESIGNING A CORE
FOR VOLTAGE SCALABILITY
Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel
Motivation 2
¨ Application characteristics vary dynamically
¨ Goal: Single core design that attains ¤ High performance at nominal voltage (~0.9V) ¤ High energy efficiency at low voltage (~0.5V)
A Voltage-Scalable Core
Observations 3
¨ SRAM vs Logic delay scaling ¨ Small increase in V à large improvement in delay
0
1
2
3
4
5
6
7
8
9
10
11
12
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45
Nor
mal
ized
Del
ay
Vdd (V)
SRAMDelay LogicDelay
~2x
Vmin
ScalCore Idea 4
¨ Design a Voltage-Scalable core based on the two observations
100
1000
10000
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45
Fre
quen
cy (
MH
z)
Vdd (V)
SRAM-Freq Logic-Freq
Vop
fop ~ 1200
Vm
in
fmin ~ 900
Vlo
gic
flogic ~ 600
Vno
m
fnom ~ 3500
Vm
in
fmin ~ 900
ScalCore: A Voltage-Scalable Core 5
¨ Decouple Vdd of logic and storage structures in the pipeline ¤ Improve energy efficiency
¨ Raise Vdd of storage structures a little ¤ Reconfigure the pipeline to take advantage of faster storage structures ¤ Further improve performance and reduce leakage energy !!
¨ Evaluated the proposed design under various scenarios ¤ Reduce execution time by 31%, energy by 48% and ED by 60% over
the state of the art
Truly Voltage Scalable Core !!
ScalCore Design 6
¨ Goal: Maximize the energy-efficiency at low-voltage (EEMode)
¨ Constraint: No impact on performance or energy at nominal voltage (HPMode)
¨ Approach: In EEMode, ¤ Provide different Vdds for storage and logic stages
n Storage stage ~2x faster than logic stage
¤ Reconfigure pipeline in one of the two ways n Fuse storage stages in the pipeline (e.g. Accessing Register File) n Increase storage structure sizes
Fusing Two Pipeline Stages into One 7
HPMode
Lat
ch
Storage Stage 2a L
atch
Storage Stage 2b L
atch
CLK
Logic Stage 3 L
atch
Logic Stage 1 L
atch
Vnom Vnom Vnom Vnom
Fusing Two Pipeline Stages into One 8
Enable Flow-through
0
HPMode
Lat
ch
2a
Lat
ch
Lat
ch
CLK
Logic Stage 3 L
atch
Logic Stage 1 L
atch
2b
Vlogic Vlogic Vop Vop Enable Flow-through
1
EEMode
Lat
ch
Storage Stage 2a L
atch
Storage Stage 2b L
atch
CLK
Logic Stage 3 L
atch
Logic Stage 1 L
atch
Vnom Vnom Vnom Vnom
Increasing Size of Structures 9
Array 0 Vnom
Decoder
Array 1 Disabled
Sense Amp Sense Amp
Data Select
HPMode 0 1
0 1
Array 0 Vop
Decoder
Array 1 Vop
Sense Amp Sense Amp
Data Select
EEMode 1 0
1 0
Array 0 Vnom
Decoder
Sense Amp
Data Select
Original
ScalCore Pipeline 10
Next PC
Decode
BranchPred.
Allocate
Register F
ile R
egister File
EX
EX
EX
LSU
ROB/Completion
Table
Main storage structures
Fetch Decode Rename Dispatch Wakeup Select Data Read
Source Drive Ex Ex Write
back
L1-I
L1-D
Register F
ile R
egister File LSU
ROB/Completion
Table
Allocate
BranchPred.
Pipeline Structure Changes 11
Fuse Stages Increase Size
Register File Array access + Source drive 1.5X PRF
Allocation Rename + Dispatch ---
Load Store Unit Addr. generation + Memory disambiguation
1.5X Load/Store Queue and Store buffer
Reorder Buffer --- 1.5X ROB
Branch Prediction --- ---
Controlpath Changes 12
¨ Programmable counters for variable latencies ¤ Execution schedules ¤ Wakeup logic
¨ Programmable counters for variable sizes ¤ List of free registers ¤ ROB, Load/Store queue sizes
Circuit Issues 13
EEMode Storage Stage 2a
Storage Stage 2b
Logic Stage 1
Lat
ch w
/ L
evel
C
onv.
Lat
ch
Lat
ch
Logic Stage 3
Vlogic
Vop
Vnom
HPMode
Storage Stage 2a
Storage Stage 2b L
atch
F.Ishihara, F.Sheikh, and B.Nikolic,“Level Conversion for Dual-Supply Systems,” IEEE Transactions on VLSI Systems, February 2004
Circuit Issues 14
Vlogic
Vop
d q (inv)
ck
ck clk
EEMode Storage Stage 2a
Storage Stage 2b
Logic Stage 1
Lat
ch w
/ L
evel
C
onv.
Lat
ch
Lat
ch
Logic Stage 3
Vlogic
Vop
Vnom
HPMode
Storage Stage 2a
Storage Stage 2b L
atch
F.Ishihara, F.Sheikh, and B.Nikolic,“Level Conversion for Dual-Supply Systems,” IEEE Transactions on VLSI Systems, February 2004
Evaluation Methodology 15
¨ 16 OOO cores ¤ 64K I-L1/D-L1, 1MB L2
¨ Voltages and frequencies ¤ HPMode
n Vnom = 0.90 V, f=3.5GHz
¤ EEMode n Vlogic = 0.50 V, f=600MHz n Vop = 0.65 V, f=600MHz n Vmin = 0.6V, f=900MHz
100
1000
10000
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45
Fre
quen
cy (
MH
z)
Vdd (V)
Vop
fop ~ 1200
Vm
in
fmin ~ 900
Vlo
gic
flogic ~ 600
Vno
m
fnom ~ 3500
Vm
in
fmin ~ 900
Configurations 16
¨ Baseline: ¤ HPRef: Conventional HP (3.5GHz,0.9V) ¤ DVFS: Most aggressive (900MHz,0.6V)
¨ ScalCore: ¤ HPMode: HPRef with penalty (3.3GHz,0.9V) ¤ EEMode (600MHz,0.5V,0.65V)
n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ;
Increase size ROB, LSQ structures
Iso-Power Comparison for Parallel Programs 17
2.6 2.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Average Execution Time 7.0 7.5 1.9
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Average Energy-Delay
Reduced execution time by 31%; ED by 60% compared to DVFS
15% 31% 60%
23%
Dynamic Workloads 18
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Execution Time Energy Energy-Delay Product
Dyn-Baseline
Static-16-SC
Dyn-SC
Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS
31%
22%
46%
28%
19%
15%
Also in the Paper 19
¨ ScalCore complexities and overheads ¨ Comparison with Intel Claremont ¨ Impact on different classes of applications ¨ More results
¤ Unconstrained power budget ¤ Intermediate design points of ScalCore
Conclusion 20
¨ Presented ScalCore, a core designed for voltage scalability
¨ Designed a voltage-scalable core by ¤ Decoupling Vdd of logic and storage structures in the pipeline ¤ Raising Vdd of storage structures a little
¤ Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy
SCALCORE: DESIGNING A CORE
FOR VOLTAGE SCALABILITY
Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel
Why not Big/Little ? 22
¨ Heterogeneous cores on a chip ideally suited for different tasks
¨ Fixed partitioning of cores ¨ A fraction of chip unused ¨ Migration overhead
Configurations 23
¨ Baseline: ¤ HPRef: Conventional HP (3.5GHz,0.9V) ¤ DVFS: Most aggressive (900MHz,0.6V)
¨ ScalCore: ¤ HPMode: HPRef with penalty (3.3GHz,0.9V) ¤ EEMode (600MHz,0.5V,0.65V)
n Pipe2Vdd: Separate voltages only n SCspeed: Fuse the two stages of RF, LSU, Allocate n SC: Fuse the two stages of RF, Allocate
Increase size ROB, LSQ structures