SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITYiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_hpca16_2.pdf · 2016-03-16 · ScalCore: A Voltage-Scalable Core 5 ! Decouple V dd

SCALCORE: DESIGNING A CORE

FOR VOLTAGE SCALABILITY

Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

Motivation 2

¨  Application characteristics vary dynamically

¨  Goal: Single core design that attains ¤ High performance at nominal voltage (~0.9V) ¤ High energy efficiency at low voltage (~0.5V)

A Voltage-Scalable Core

Observations 3

¨  SRAM vs Logic delay scaling ¨  Small increase in V à large improvement in delay

0

1

2

3

4

5

6

7

8

9

10

11

12

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45

Nor

mal

ized

Del

ay

Vdd (V)

SRAMDelay LogicDelay

~2x

Vmin

ScalCore Idea 4

¨  Design a Voltage-Scalable core based on the two observations

100

1000

10000

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45

Fre

quen

cy (

MH

z)

Vdd (V)

SRAM-Freq Logic-Freq

Vop

fop ~ 1200

Vm

in

fmin ~ 900

Vlo

gic

flogic ~ 600

Vno

m

fnom ~ 3500

Vm

in

fmin ~ 900

ScalCore: A Voltage-Scalable Core 5

¨  Decouple Vdd of logic and storage structures in the pipeline ¤  Improve energy efficiency

¨  Raise Vdd of storage structures a little ¤  Reconfigure the pipeline to take advantage of faster storage structures ¤  Further improve performance and reduce leakage energy !!

¨  Evaluated the proposed design under various scenarios ¤  Reduce execution time by 31%, energy by 48% and ED by 60% over

the state of the art

Truly Voltage Scalable Core !!

ScalCore Design 6

¨  Goal: Maximize the energy-efficiency at low-voltage (EEMode)

¨  Constraint: No impact on performance or energy at nominal voltage (HPMode)

¨  Approach: In EEMode, ¤  Provide different Vdds for storage and logic stages

n  Storage stage ~2x faster than logic stage

¤  Reconfigure pipeline in one of the two ways n  Fuse storage stages in the pipeline (e.g. Accessing Register File) n  Increase storage structure sizes

Fusing Two Pipeline Stages into One 7

HPMode

Lat

ch

Storage Stage 2a L

atch

Storage Stage 2b L

atch

CLK

Logic Stage 3 L

atch

Logic Stage 1 L

atch

Vnom Vnom Vnom Vnom

Fusing Two Pipeline Stages into One 8

Enable Flow-through

0

HPMode

Lat

ch

2a

Lat

ch

Lat

ch

CLK

Logic Stage 3 L

atch

Logic Stage 1 L

atch

2b

Vlogic Vlogic Vop Vop Enable Flow-through

1

EEMode

Lat

ch

Storage Stage 2a L

atch

Storage Stage 2b L

atch

CLK

Logic Stage 3 L

atch

Logic Stage 1 L

atch

Vnom Vnom Vnom Vnom

Increasing Size of Structures 9

Array 0 Vnom

Decoder

Array 1 Disabled

Sense Amp Sense Amp

Data Select

HPMode 0 1

0 1

Array 0 Vop

Decoder

Array 1 Vop

Sense Amp Sense Amp

Data Select

EEMode 1 0

1 0

Array 0 Vnom

Decoder

Sense Amp

Data Select

Original

ScalCore Pipeline 10

Next PC

Decode

BranchPred.

Allocate

Register F

ile R

egister File

EX

EX

EX

LSU

ROB/Completion

Table

Main storage structures

Fetch Decode Rename Dispatch Wakeup Select Data Read

Source Drive Ex Ex Write

back

L1-I

L1-D

Register F

ile R

egister File LSU

ROB/Completion

Table

Allocate

BranchPred.

Pipeline Structure Changes 11

Fuse Stages Increase Size

Register File Array access + Source drive 1.5X PRF

Allocation Rename + Dispatch ---

Load Store Unit Addr. generation + Memory disambiguation

1.5X Load/Store Queue and Store buffer

Reorder Buffer --- 1.5X ROB

Branch Prediction --- ---

Controlpath Changes 12

¨  Programmable counters for variable latencies ¤ Execution schedules ¤ Wakeup logic

¨  Programmable counters for variable sizes ¤ List of free registers ¤ ROB, Load/Store queue sizes

Circuit Issues 13

EEMode Storage Stage 2a

Storage Stage 2b

Logic Stage 1

Lat

ch w

/ L

evel

C

onv.

Lat

ch

Lat

ch

Logic Stage 3

Vlogic

Vop

Vnom

HPMode

Storage Stage 2a

Storage Stage 2b L

atch

F.Ishihara, F.Sheikh, and B.Nikolic,“Level Conversion for Dual-Supply Systems,” IEEE Transactions on VLSI Systems, February 2004

Circuit Issues 14

Vlogic

Vop

d q (inv)

ck

ck clk

EEMode Storage Stage 2a

Storage Stage 2b

Logic Stage 1

Lat

ch w

/ L

evel

C

onv.

Lat

ch

Lat

ch

Logic Stage 3

Vlogic

Vop

Vnom

HPMode

Storage Stage 2a

Storage Stage 2b L

atch

F.Ishihara, F.Sheikh, and B.Nikolic,“Level Conversion for Dual-Supply Systems,” IEEE Transactions on VLSI Systems, February 2004

Evaluation Methodology 15

¨  16 OOO cores ¤  64K I-L1/D-L1, 1MB L2

¨  Voltages and frequencies ¤  HPMode

n  Vnom = 0.90 V, f=3.5GHz

¤  EEMode n  Vlogic = 0.50 V, f=600MHz n  Vop = 0.65 V, f=600MHz n  Vmin = 0.6V, f=900MHz

100

1000

10000

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45

Fre

quen

cy (

MH

z)

Vdd (V)

Vop

fop ~ 1200

Vm

in

fmin ~ 900

Vlo

gic

flogic ~ 600

Vno

m

fnom ~ 3500

Vm

in

fmin ~ 900

Configurations 16

¨  Baseline: ¤ HPRef: Conventional HP (3.5GHz,0.9V) ¤ DVFS: Most aggressive (900MHz,0.6V)

¨  ScalCore: ¤ HPMode: HPRef with penalty (3.3GHz,0.9V) ¤ EEMode (600MHz,0.5V,0.65V)

n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ;

Increase size ROB, LSQ structures

Iso-Power Comparison for Parallel Programs 17

2.6 2.7

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Average Execution Time 7.0 7.5 1.9

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Average Energy-Delay

Reduced execution time by 31%; ED by 60% compared to DVFS

15% 31% 60%

23%

Dynamic Workloads 18

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Execution Time Energy Energy-Delay Product

Dyn-Baseline

Static-16-SC

Dyn-SC

Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS

31%

22%

46%

28%

19%

15%

Also in the Paper 19

¨  ScalCore complexities and overheads ¨  Comparison with Intel Claremont ¨  Impact on different classes of applications ¨  More results

¤ Unconstrained power budget ¤  Intermediate design points of ScalCore

Conclusion 20

¨  Presented ScalCore, a core designed for voltage scalability

¨  Designed a voltage-scalable core by ¤  Decoupling Vdd of logic and storage structures in the pipeline ¤  Raising Vdd of storage structures a little

¤  Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy

SCALCORE: DESIGNING A CORE

FOR VOLTAGE SCALABILITY

Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

Why not Big/Little ? 22

¨  Heterogeneous cores on a chip ideally suited for different tasks

¨  Fixed partitioning of cores ¨  A fraction of chip unused ¨  Migration overhead

Configurations 23

¨  Baseline: ¤ HPRef: Conventional HP (3.5GHz,0.9V) ¤ DVFS: Most aggressive (900MHz,0.6V)

¨  ScalCore: ¤ HPMode: HPRef with penalty (3.3GHz,0.9V) ¤  EEMode (600MHz,0.5V,0.65V)

n  Pipe2Vdd: Separate voltages only n  SCspeed: Fuse the two stages of RF, LSU, Allocate n  SC: Fuse the two stages of RF, Allocate

Increase size ROB, LSQ structures

Documents

SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITYiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_hpca16_2.pdf · 2016-03-16 · ScalCore: A Voltage-Scalable Core 5 ! Decouple V dd