27
Variable Word Width Computation for Low Power By Bret Victor Sayf Alalusi

Variable Word Width Computation for Low Power

  • Upload
    ciqala

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Variable Word Width Computation for Low Power. By Bret Victor Sayf Alalusi. Motivation. 32 bit architecture required for most general purpose computing However, many applications don’t need a full 32 bit data word: Video: 24 bit Audio: 16 bit Text: 8 bit Logic: 1 bit - PowerPoint PPT Presentation

Citation preview

Page 1: Variable Word Width Computation for Low Power

Variable Word Width Computation for Low Power

By

Bret Victor

Sayf Alalusi

Page 2: Variable Word Width Computation for Low Power

Motivation

• 32 bit architecture required for most general purpose computing

• However, many applications don’t need a full 32 bit data word:– Video: 24 bit

– Audio: 16 bit

– Text: 8 bit

– Logic: 1 bit

• How can we exploit this to save power?

Page 3: Variable Word Width Computation for Low Power

Possibilities

• Architecture that supports 32, 24, 16, 8, and 1 bit operations? Or some subset?

• Switch processor between modes, or specify width for each instruction? Global or distributed control?

• Gated clocks? Don’t drive unused outputs? Power down unused blocks?

Page 4: Variable Word Width Computation for Low Power

Implementation

• Based on MIPS architecture and ISA• Two widths: 16 bit and 32 bit• Width chosen on instruction-by-instruction basis.• Flag bit in instruction word selects width• Modified ISA:

– arithmetic: add16, add32; mul16, mul32– logical: and16, and32– memory: lw16, lw32; sw16, sw32– branch compare: beq16, beq32

Page 5: Variable Word Width Computation for Low Power

Energy

• Energy consumption occurs when a node transitions, and is proportional to the capacitance at that node.

• Prevent nodes from transitioning unnecessarily.• Energy savings can be calculated by adding all the

capacitance that is switching.

Page 6: Variable Word Width Computation for Low Power

Where We Save Energy

• Our design saves energy over a traditional processor in three main areas:– Clock and control line energy

– HWTE (High Word Transition Energy)

– Memory control energy

• We will see these three areas as we step through the pipeline.

Page 7: Variable Word Width Computation for Low Power

Pipeline Overview

+4

MUX

PC I$

32

32

32

srcAsrcB

destdata

outAoutB

+

=

immed: 16

dest reg: 5

32

32

branch address: 32

PC + 4: 32

branch offset

MUX

reg A

fwd from MEM

fwd from WB

MUX

reg B

fwd from MEM

fwd from WB

immed

ALU

32

32

dest reg: 5

data for SW: 32

32addr

wr data

rd data32

32

dest reg: 5

ALU result: 32

32

MUX

32

32

32

5

5

5

32

5

32

IF/ID ID/EX

EX/MEM MEM/WB

Page 8: Variable Word Width Computation for Low Power

IF Stage

+4

MUX

PC I$

32

32

32

branch address: 32

IF/ID

32

PC + 4: 32

• Instruction words and addresses must be 32 bits.• Can’t modify much.

Page 9: Variable Word Width Computation for Low Power

ID Stage

• We can:– gate the clocks of the pipeline register

– only drive high words out of register file if 32 bit operation

srcAsrcB

destdata

outAoutB

+

=

immed: 16

dest reg: 5

32

32

branch address: 32

PC + 4: 32

branch offset

5

5

5

32

32

IF/ID ID/EX

Page 10: Variable Word Width Computation for Low Power

Pipeline Register (ID)

• Fit gating into clock distribution network.• Little energy overhead and helps control skew.• On ID stage, gating reduces clock energy by:

– 56% on 16-bit operations– 19% on 32-bit non-immediate operations

reg A: low 16

reg B: low 16

DQ

C

Clock

Width

UngatedClock

WidthGatedClock

reg A: high 16

reg B: high 16

WidthGatedClockUngatedClock

destReg: 5

immed: 16

ImmedGatedClock(from instruction word)

Page 11: Variable Word Width Computation for Low Power

Register File Read Port (ID)• Decoder selects register to

drive output bus.

• We add one AND gate per register.

• Switching capacitance dominated by output bus.

• 16 bit operation takes 50% less energy than 32 bit operation....

• Not necessarily savings!

DECODER

Reg 0: high 16

Reg 0: low 16

NN

Width

Reg 1: high 16

Reg 1: low 16

NN

Width

16

16

16

16

Page 12: Variable Word Width Computation for Low Power

EX Stage

• Modify the ALU to perform 16 bit operations.

• Prevent the high word output of the MUXes from changing on 16 bit operations.

• Gate the clock of the pipeline register:– Only latch high word of ALU result on 32 bit operations

– Only latch reg B on “store word” operations

MUX

reg A

fwd from MEM

fwd from WB

MUX

reg B

fwd from MEM

fwd from WB

immed

ALU

32

32

dest reg: 5

data for SW: 32

32

Page 13: Variable Word Width Computation for Low Power

Logical Inst.’s (EX)

• Just don’t let the unused bits (high 16) transition• If they don’t transition, they will not drive the next

stage either.• 50% less energy

e.g. X AND Y

X0 -------Y0 -------

X1 -------Y1 -------

X31 ------Y31 ------

Page 14: Variable Word Width Computation for Low Power

Adder (EX)

• The 4CLA blocks just get replicated for the number of bits, but the upper level CLA structure will grow with the number of bits.

• 16 bits: 58% less energy

0..3

4..7

8..11

12..15

16..19

20..23

24..27

28..31

Upper LevelCLAGeneration

A0 B0 … … An Bn

S0 Sn

Page 15: Variable Word Width Computation for Low Power

Multiplier (EX)

• Multiply complexity grows as N2, so a 16 bit multiply takes 77% less energy.

• Even if upper 16 bits = 0, a 32 bit multiply does 16 extra shifts.

32 x 32bit adds32 x 32bit reg. writes32 shiftsIn 32 cycles

Vs.

16 x 16bit adds16 x 16bit reg. writes16 shiftsIn 16 cycles

Page 16: Variable Word Width Computation for Low Power

HWTE

• Two types of data in 16 bit application:– Computational data (16-bit): high word = 0

– Pointers and addresses (32-bit): high word = C

• Assume C “mostly constant” (memory accesses mostly in 64K block)

• Traditional processor only consumes more datapath energy than our processor when transitioning between these data types.

• HWTE = High Word Transition Energy

Page 17: Variable Word Width Computation for Low Power

HWTE

• With such a model, our processor effectively only excecutes “16 bit operations”.

• Traditional processor excecutes “32 bit operations” only when transitioning between data types.

• E32 = energy of 32 bit operation

• E16 = energy of 16 bit operation

• N = average number of consecutive instructions that use the same data type

• HWTE = ( E32 - E16 ) / N

Page 18: Variable Word Width Computation for Low Power

Barrel Shifter (EX)

• Big win will come from not driving the control lines to the upper 16 bits.

• Save about 50% in energy

SH0 SH1 SH2 SH3

B3

B2

B1

B0

A3

A2

A1

A0

Page 19: Variable Word Width Computation for Low Power

MEM Stage

• This is a big, regular memory (SRAM) structure that can easily be segmented into blocks.– Exploit this fact

addr

wr data

rd data32

32

dest reg: 5

ALU result: 32

32

Page 20: Variable Word Width Computation for Low Power

DCache (MEM)

• 2-way set associative, write-back• Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data

values are aligned on 16b boundaries, 32 on 32b.

Width

Block #

Only drive the word line that you need!

Page 21: Variable Word Width Computation for Low Power

DCache (MEM)

• Only drive the word lines that are needed.– Need a little bit of logic to figure out what the correct

lines are, but large capacitance of WL dominates.

• Block size is larger for 16 bit values, better exploits spatial locality

• Associativity does not change from 16 bit to 32 bit word lengths

• Energy savings: 50% – Control Line Savings, no HWTE!

Page 22: Variable Word Width Computation for Low Power

WB Stage

• On a 16 bit operation, we can:– Only drive the low word out of the MUX

• Capacitive load on register write port is large

• Driving 16 bits out of the MUX consumes 50% less energy than driving 32 bits… HWTE formula applies.

– Only latch the low word into the register?

srcAsrcB

destdata

outAoutB

MUX

Mem data: 32

ALU result: 32

5

32

MEM/WB

Dest reg: 5

Page 23: Variable Word Width Computation for Low Power

Reg. File Write Port (WB)• We can add one AND

gate for each register.

• But 16 bit write uses same amount of clock energy as 32 bit write without modifications.

• Little savings from not writing into the register, because the high word would not change in a 16 bit application.

• Not worth it!

DECODER

C Reg 0: low 16D

HiWrite

C Reg 0: high 16D

Write

C Reg 1: low 16D

HiWrite

C Reg 1: high 16D

Write

HiWriteWrite

Width

16

16

16

16

Page 24: Variable Word Width Computation for Low Power

Summary

• Typical power distribution in core (non-memory):– ALU: 34% x 66%

– I-decode: 23% x 100%

– Register file: 13% x 66%

– Clock: 10% x 50%

– Shifter: 11% x 50%

– Pipeline: 9% x 74%

• Core energy reduced by 29%.

Page 25: Variable Word Width Computation for Low Power

Summary

• Typical power distribution in memory:– Instruction cache 60% x 100%

– Data cache 40% x 50%

• Cache energy reduced by 20%.• Total processor power consumption:

– Cache 66% x 80%

– Core 33% x 71%

• Total energy reduced by 24% when executing a 16 bit application.

Page 26: Variable Word Width Computation for Low Power

Conclusions

• Primary drawback is modification of ISA. • Energy savings are reasonable.• Our modifications are fairly easy to implement,

and can be fit into existing processor designs with minimal area increase.

Page 27: Variable Word Width Computation for Low Power

Where do we go from here?

• More accurate capacitance models and SPICE simulation

• More accurate models of instruction mix