41
D ynamic Frequency-V oltage S caling for M ultiple C lock D omain Processors and Implications on Asymmetric M ultiple C ore P rocessors Avshalom Elyada

Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

Dynamic Frequency-Voltage Scalingfor

Multiple Clock Domain Processors

and Implications on

Asymmetric Multiple Core Processors

Avshalom Elyada

Page 2: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion2

Based primarily on the work of

Greg Semeraro, David H.Albonesi et. al.

University of Rochester, NY.And also

Diana Marculescu et. al.

Carnegie Mellon University, PA.

Page 3: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion3

Outline• Multiple Clock Domains

• Inter-domain communication and synchronization

• Dynamic Frequency-Voltage Scaling

• Scaling algorithms– Offline, Attack-Decay, Dynamic Profiling– Results comparison

• DVS in Multiple Core Processors...?

Page 4: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion4

End of the Road forGlobally-Synchronous

• Global hi-freq clock does not scale well– Low clock reachability

within a single clock cycle• Interconnect does not scale

well• Clock-tree complexity,

skew, power-inefficiency

Page 5: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion5

Multiple-Clock-Domainsor Globally Asynchronous Locally Synchronous

• Divide core into separate clock domains

• Synchronize communication between synchronous “islands”

• Speedup freq of separate smaller domains

• Good inter-domain communication design– To minimize synchronization performance costs

• Retain traditional synchronous knowledge-base

Page 6: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion6

MCD Processor(Alpha 21264–like Model, Rochester DVS research)

Page 7: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion7

Multi-Synchronous

Multi-Synchronous

Each domainseparate clock

at same frequency

GloballySynchronous

Single clock

MCD

Page 8: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion8

Dynamic Frequency-Voltage Scaling

• If all domains always run at max freq, this is usually a waste of power

• Only critical domain need run at max freq, others can run slower

• This saves power

• Performance degradation should be minimal

Page 9: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion9

MCD and GALS

Multi-Synchronous

Each domainseparate clock

at same frequency

MCD

Globally AsyncLocally Sync

Async domains:Different frequency

per domain

GloballySynchronous

Single clock

Page 10: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion10

Integer Dominated

Page 11: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion11

Load-Store Dominated

Page 12: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion12

D(F)VS Continued• 20-40% Energy-Delay improvement• Voltage scales down with freq, saving additional

power:– Potential for X3 savings

2DDPower fV

1 1

DD

DelayV f

2 3, 1 ( ) ( )f Power f f f

• Careful : wrong scaling is catastrophic on performance

Page 13: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion13

Scaling is Gradual and Occurs During Regular Operation

• F may be decreased before V decreased

• V must be increased before F may increase

Voltage

Freq (MHz)

F-V working points

1.000V 1.172V

727.3

729.6

Page 14: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion14

MCD and GALS

Multi-Synchronous

Each domainseparate clock

at same frequency

MCD

GALS

Async domains:Different frequency

per domain

Autonomous

GloballySynchronous

Single clock

DVS)C-GALS(

Different frequency per domain

Centrally controlled

Page 15: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion15

Configuration Parameters (XScale-like)

• 320 Frequency-Voltage working-points

• Freq range 250-1000 MHz

• Voltage range 0.65-1.20 V– Step between work-points: 0.172 mV / 2.34 MHz – Change rate: 0.172 uSec / Step

(55uSec end-to-end)

• Time step: change each 50K cycles

Page 16: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion16

DVS per domain - Considerations• Scaling algorithm:

– Determine F-V point of each domain at any time– Temporal granularity

• how often to change the F-V point

• Synchronization– Multi-Sync - all domains run @ same freq

• Simple sync solutions exist (phase compensation)

– When GALS – different and changing frequencies• Asynchronous sync. solution, impedes performance• Or think of better solutions…

Page 17: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion17

Power-bounded DVS• Given power

envelope• Mobilize energy

between domains to attain max performance

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time-step

En

erg

y

Integer Front-end Floating-point

External Memory Memory Domain

Page 18: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion18

Scaling Algorithm• Input : A serial program

• Output: Parallel, temporal specification of which domains slowed by how much

• Temporal Granularity– Time-step should be short enough to be dynamic– Too short ineffective due to:

• Gradual scaling• Overhead of the change

Page 19: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion19

Scaling Algorithms• ‘Offline’ Algorithm

– Full preparation on a simulator– Insert F-V config instructions for actual run

• ‘Online’ (Attack-Decay)– Done entirely in hardware– Rescale F-V acc. to internal queue levels

• Dynamic Profiling– Short profile run, find program phases – Rescale F-V on phase transitions

Page 20: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion20

Offline Algorithm• Run the program on a simulator

at max speed, trace Primitive Events

– Primitive event = work performed in single domain on behalf of single instruction

• Construct Directed Acyclic Graph– functional and data dependencies

between primitive events– Arcs represent time between events

Page 21: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion21

Offline Algorithm Contd

Stretch Slack

• Slack appears on non-critical paths

• Stretch events that are not in critical time path

Page 22: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion22

Offline Algorithm Contd.• Now we have desired scale-down of single

primitive events

• Need to scale down domains per time-step– Construct Event Histograms per domain per time-

step: H(domain, time-step)– Assign tolerable performance degradation %p– Determine actual scale-down per-domain

according to (H, p)

Page 23: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion23

OnlineAlgorithm

• Each time step, sample input queue levels– Attack: if queue level up by ~2%, inc freq by 6%– Decay: if level unchanged, dec freq ~0.2%

• Simple, HW only, results ~70% of offline• Watch out for perturbations, local-minima,

over-activism & other feedback-related pitfalls

freq

queuefull

Page 24: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion24

Page 25: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion25

Dynamic Profiling• Execution shows repeating Program Phases

– Phase often delimited by subroutine call or loop

• Dynamic Profiling:– Identify phases by a short profiling run– Insert phase marks and FV config into program– When program reaches a mark, reconfig FV

Page 26: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion26

Results Comparison

Page 27: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion27

Improved Dynamic Profiling

• Each program will carry its phase-information as initial setup data– Assuming phase info not processor-specific– alternatively, processor-specific compilation

• Or, processor itself will perform the profile run– HW based dynamic profiling,

eliminating the need forsimulation pre-run

Page 28: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion28

DVS in ACCMP

• Conceptual Difference:– MCD Processor: sub-units run @ diff. freq.

– MCP: Threads run @ diff. freq.

• ACCMP - different size cores

• ACCMP with DVS - Cores also dynamically change frequency

Page 29: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion29

DVS - Degree of Freedom• ACCMP

– Allocate thread to static strength processor:

LM

S

LMS performance

• ACCMP with DVS– Scale processor to performance needs– Dynamically accommodate 40-50 36-44

32-38

Stretch-fit

40-5036-44

32-38

Page 30: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion30

Dynamic Thread Allocation

Performance

Pow

er

LargeMediumSmall

• 3 sizes DVS processors

Page 31: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion31

Dynamic Thread Allocation

Performance

Powe

r

Large

Medium

Small

• 3 sizes DVS processors• Thread “wants” performance

between M & L processors

Page 32: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion32

Dynamic Thread Allocation

Performance

Powe

r

Large

Medium

Small

• 3 sizes DVS processors• Thread “wants” performance

between M & L processors• Allocate to M only, hurt

performance, but still better than static ACCMP

Page 33: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion33

Dynamic Thread Allocation

Performance

Powe

r

Large

Medium

Small

• 3 sizes DVS processors• Thread “wants” performance

between M & L processors• Allocate to M only, hurt

performance, but still better than static ACCMP

• To L only, waste power

Page 34: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion34

Dynamic Thread Allocation

Performance

Powe

r

Large

Medium

Small

• 3 sizes DVS processors• Thread “wants” performance

between M & L processors• Allocate to M only, hurt

performance, but still better than static ACCMP

• To L only, waste power• Or migrate between both,

acc. to performance needs• What is best?

Page 35: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion35

• k Migrations M↔L processors

• Phases φM, φL on each of the processors

Migration

( ) ( )mig

M L

M M L L

T T

Energy kE Pwr f dt Pwr f dt

( ) ( ) ( ) ( )mig

M L

M M M M L L L LEnergy kE Pwr T Pwr T

( ) ( )mig

M L

M M L LDelay kE T T

Performance

Pow

er

Large

Medium

Small

minkEnergy Delay

Page 36: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion36

The End

Page 37: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion37

DVS in Multiple Core Processors

• Asymmetric Cores– Asymmetric size cores suggested to better utilize

die area when too few threads• But research shows symmetric cores perform better

when have enough threads

– With DVS, a core’s performance dynamically varies acc. to freq.

• Viewed in a Performance/Energy metric, this is a more flexible kind of asymmetry …

• Also Simplify SW decision of which thread to assign to which asymmetric core

Page 38: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion38

Inter-Domain Communication• In order to minimize synchronization penalty

– divide area into domains where there inherently exists a dual-port queue structure

• Dual-port FIFO synchronization solution

– Otherwise divide where minimum inter-domain communication

Dual-PortFIFO

synchronizer

wclkwen

wdata

full

rclkren

rdata

empty

Producer Domain

Producer Domain

Consumer Domain

Consumer Domain

Page 39: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion39

Dual-Port FIFO

• Producer/Consumer domains can write/read independently as long as FIFO is not full or empty

• Full & Empty are the only signals that need syncing

• Therefore sync penalty incurred only when FIFO is full or empty

Page 40: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion40

Syncing Periodic Domains– Synchronization solutions which exploit no knowledge of

clock relations are sub-optimal• Examples: two-flop and even dual-port FIFO

– DVS: clock relations are Periodic, Dynamic, and Known• Predictive Synchronizer can predict when conflict will

occur between different periodic clocks– But conflict prediction sometimes adapts slowly to freq changes

– DVS makes possible to exploit the fact that domain frequencies are Known

• Propose a multi-freq. sync. that can detect conflict by knowing at which freq. it’s provider and consumer run

Page 41: Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion41

Gradual Scaling• Device works throughout the change

• Necessary for 2 reasons– Online algorithm based on steadily changing

feedback control– ? Synchronizers can’t cope with step-change

• Using Dynamic Profiling + adequate synchronizers, can do instant scaling