Variable Word Width Computation for Low Power

Variable Word Width Computation for Low Power

By

Bret Victor

Sayf Alalusi

Motivation

• 32 bit architecture required for most general purpose computing

• However, many applications don’t need a full 32 bit data word:– Video: 24 bit

– Audio: 16 bit

– Text: 8 bit

– Logic: 1 bit

• How can we exploit this to save power?

Possibilities

• Architecture that supports 32, 24, 16, 8, and 1 bit operations? Or some subset?

• Switch processor between modes, or specify width for each instruction? Global or distributed control?

• Gated clocks? Don’t drive unused outputs? Power down unused blocks?

Implementation

• Based on MIPS architecture and ISA• Two widths: 16 bit and 32 bit• Width chosen on instruction-by-instruction basis.• Flag bit in instruction word selects width• Modified ISA:

– arithmetic: add16, add32; mul16, mul32– logical: and16, and32– memory: lw16, lw32; sw16, sw32– branch compare: beq16, beq32

Energy

• Energy consumption occurs when a node transitions, and is proportional to the capacitance at that node.

• Prevent nodes from transitioning unnecessarily.• Energy savings can be calculated by adding all the

capacitance that is switching.

Where We Save Energy

• Our design saves energy over a traditional processor in three main areas:– Clock and control line energy

– HWTE (High Word Transition Energy)

– Memory control energy

• We will see these three areas as we step through the pipeline.

Pipeline Overview

+4

MUX

PC I$

32

32

32

srcAsrcB

destdata

outAoutB

+

=

immed: 16

dest reg: 5

32

32

branch address: 32

PC + 4: 32

branch offset

MUX

reg A

fwd from MEM

fwd from WB

MUX

reg B

fwd from MEM

fwd from WB

immed

ALU

32

32

dest reg: 5

data for SW: 32

32addr

wr data

rd data32

32

dest reg: 5

ALU result: 32

32

MUX

32

32

32

5

5

5

32

5

32

IF/ID ID/EX

EX/MEM MEM/WB

IF Stage

+4

MUX

PC I$

32

32

32

branch address: 32

IF/ID

32

PC + 4: 32

• Instruction words and addresses must be 32 bits.• Can’t modify much.

ID Stage

• We can:– gate the clocks of the pipeline register

– only drive high words out of register file if 32 bit operation

srcAsrcB

destdata

outAoutB

+

=

immed: 16

dest reg: 5

32

32

branch address: 32

PC + 4: 32

branch offset

5

5

5

32

32

IF/ID ID/EX

Pipeline Register (ID)

• Fit gating into clock distribution network.• Little energy overhead and helps control skew.• On ID stage, gating reduces clock energy by:

– 56% on 16-bit operations– 19% on 32-bit non-immediate operations

reg A: low 16

reg B: low 16

DQ

C

Clock

Width

UngatedClock

WidthGatedClock

reg A: high 16

reg B: high 16

WidthGatedClockUngatedClock

destReg: 5

immed: 16

ImmedGatedClock(from instruction word)

Register File Read Port (ID)• Decoder selects register to

drive output bus.

• We add one AND gate per register.

• Switching capacitance dominated by output bus.

• 16 bit operation takes 50% less energy than 32 bit operation....

• Not necessarily savings!

DECODER

Reg 0: high 16

Reg 0: low 16

NN

Width

Reg 1: high 16

Reg 1: low 16

NN

Width

16

16

16

16

EX Stage

• Modify the ALU to perform 16 bit operations.

• Prevent the high word output of the MUXes from changing on 16 bit operations.

• Gate the clock of the pipeline register:– Only latch high word of ALU result on 32 bit operations

– Only latch reg B on “store word” operations

MUX

reg A

fwd from MEM

fwd from WB

MUX

reg B

fwd from MEM

fwd from WB

immed

ALU

32

32

dest reg: 5

data for SW: 32

32

Logical Inst.’s (EX)

• Just don’t let the unused bits (high 16) transition• If they don’t transition, they will not drive the next

stage either.• 50% less energy

e.g. X AND Y

X0 -------Y0 -------

X1 -------Y1 -------

X31 ------Y31 ------

Adder (EX)

• The 4CLA blocks just get replicated for the number of bits, but the upper level CLA structure will grow with the number of bits.

• 16 bits: 58% less energy

0..3

4..7

8..11

12..15

16..19

20..23

24..27

28..31

Upper LevelCLAGeneration

A0 B0 … … An Bn

S0 Sn

Multiplier (EX)

• Multiply complexity grows as N2, so a 16 bit multiply takes 77% less energy.

• Even if upper 16 bits = 0, a 32 bit multiply does 16 extra shifts.

32 x 32bit adds32 x 32bit reg. writes32 shiftsIn 32 cycles

Vs.

16 x 16bit adds16 x 16bit reg. writes16 shiftsIn 16 cycles

HWTE

• Two types of data in 16 bit application:– Computational data (16-bit): high word = 0

– Pointers and addresses (32-bit): high word = C

• Assume C “mostly constant” (memory accesses mostly in 64K block)

• Traditional processor only consumes more datapath energy than our processor when transitioning between these data types.

• HWTE = High Word Transition Energy

HWTE

• With such a model, our processor effectively only excecutes “16 bit operations”.

• Traditional processor excecutes “32 bit operations” only when transitioning between data types.

• E32 = energy of 32 bit operation

• E16 = energy of 16 bit operation

• N = average number of consecutive instructions that use the same data type

• HWTE = ( E32 - E16 ) / N

Barrel Shifter (EX)

• Big win will come from not driving the control lines to the upper 16 bits.

• Save about 50% in energy

SH0 SH1 SH2 SH3

B3

B2

B1

B0

A3

A2

A1

A0

MEM Stage

• This is a big, regular memory (SRAM) structure that can easily be segmented into blocks.– Exploit this fact

addr

wr data

rd data32

32

dest reg: 5

ALU result: 32

32

DCache (MEM)

• 2-way set associative, write-back• Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data

values are aligned on 16b boundaries, 32 on 32b.

Width

Block #

Only drive the word line that you need!

DCache (MEM)

• Only drive the word lines that are needed.– Need a little bit of logic to figure out what the correct

lines are, but large capacitance of WL dominates.

• Block size is larger for 16 bit values, better exploits spatial locality

• Associativity does not change from 16 bit to 32 bit word lengths

• Energy savings: 50% – Control Line Savings, no HWTE!

WB Stage

• On a 16 bit operation, we can:– Only drive the low word out of the MUX

• Capacitive load on register write port is large

• Driving 16 bits out of the MUX consumes 50% less energy than driving 32 bits… HWTE formula applies.

– Only latch the low word into the register?

srcAsrcB

destdata

outAoutB

MUX

Mem data: 32

ALU result: 32

5

32

MEM/WB

Dest reg: 5

Reg. File Write Port (WB)• We can add one AND

gate for each register.

• But 16 bit write uses same amount of clock energy as 32 bit write without modifications.

• Little savings from not writing into the register, because the high word would not change in a 16 bit application.

• Not worth it!

DECODER

C Reg 0: low 16D

HiWrite

C Reg 0: high 16D

Write

C Reg 1: low 16D

HiWrite

C Reg 1: high 16D

Write

HiWriteWrite

Width

16

16

16

16

Summary

• Typical power distribution in core (non-memory):– ALU: 34% x 66%

– I-decode: 23% x 100%

– Register file: 13% x 66%

– Clock: 10% x 50%

– Shifter: 11% x 50%

– Pipeline: 9% x 74%

• Core energy reduced by 29%.

Summary

• Typical power distribution in memory:– Instruction cache 60% x 100%

– Data cache 40% x 50%

• Cache energy reduced by 20%.• Total processor power consumption:

– Cache 66% x 80%

– Core 33% x 71%

• Total energy reduced by 24% when executing a 16 bit application.

Conclusions

• Primary drawback is modification of ISA. • Energy savings are reasonable.• Our modifications are fairly easy to implement,

and can be fit into existing processor designs with minimal area increase.

Where do we go from here?

• More accurate capacitance models and SPICE simulation

• More accurate models of instruction mix

Documents

Variable Word Width Computation for Low Power