30
Warped-Compression: Enabling Power Efficient GPUs through Register Compression Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*) Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC) (*Work done while visiting USC)

Warped-Compression - University of Southern California

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Warped-Compression - University of Southern California

Warped-Compression:Enabling Power Efficient GPUs through

Register Compression

Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*)

Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC)

(*Work done while visiting USC)

Page 2: Warped-Compression - University of Southern California

Short Summary

TargetRegister File on GPUs

ProblemEnergy Consumption of Register File

SolutionData Compression on Register File

ResultsReducing 25% of Register File Energy Consumption

2

Page 3: Warped-Compression - University of Southern California

Motivation: Register Power Consumption

▪ GPUs Need Large Register Files to Maximize TLP

▪ Register File Contributes Significant Portion of the Total GPU Chip Power

▪ Register File Size Has Been Growing

512KB

Tesla(G80/G92)

1920KB

Tesla(GT200)

2048KB

Fermi(GF110)

3840KB

Kepler(GK110)

Maxwell(GM200)

6144KB

Estimated GeForce GTX480 (Fermi) Component Power

Consumption*

*Leng et al., “GPUWattch : Enabling Energy Optimizations in GPGPUs”

3

Page 4: Warped-Compression - University of Southern California

dst

src1

src2

Motivation: GPU Register Characteristics

▪ Warp: A Bundle of 32 Threads

▪ Operands of a Warp: A Bundle of 32 Thread Registers• This bundle of registers is treated as a single instruction operand in GPUs

Warp Instruction(add.u32 %r0, %r1, %r6)

r0 r0 r0 r0 r0 r0 r0 r0…

r1 r1 r1 r1 r1 r1 r1 r1…

r6 r6 r6 r6 r6 r6 r6 r6

add.u32 %r0, %r1, %r6;...

32-bit Registers X 32 (128-byte)

T0 T1 T2 T3 T28 T29 T30 T31…

4

Page 5: Warped-Compression - University of Southern California

Baseline Register File

▪ Multi-banked Register File*• 4KB per bank, 32 banks

• 128-bit wide single read/write port – Provides 4 thread operands per bank

• Access 8 banks for collecting a warp operand

Bank0

-

-

-

Bank1

-

-

-

Bank2

-

-

-

Bank3

-

-

-…

Bank4

-

-

-

Bank5

-

-

-

…Bank6

-

-

-

Bank7

-

-

-

Bank Arbiter

Operand Collector Buffer (32-bit X 32)

4KB Bank(128-bit Wide)

5

*Gebhart et al., “Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors”

16 byte

16 byte

16 byte

16 byte

16 byte

16 byte

16 byte

16 byte

Page 6: Warped-Compression - University of Southern California

Register File Access Energy

▪ Accessing Warp Operand Registers Activates Multiple Banks• Bank access energy + wire energy

4KB SRAMAccess Energy1

7pJ

128-bitWire Energy2

9.6pJ/mm

Access Energy/Warp Operand : (7 + 9.6)*8 = 132.8pJ1CACTI (1.0V, 45nm)2Gebhart et al., “Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors” (1.0v, 40nm) 6

1mm

Bank0

-

-

-

Bank1

-

-

-

Bank2

-

-

-

Bank3

-

-

-

Bank4

-

-

-

Bank5

-

-

-

…Bank6

-

-

-

Bank7

-

-

-

Bank Arbiter

Operand Collector Buffer (32-bit X 32)

16 byte

16 byte

16 byte

16 byte

16 byte

16 byte

16 byte

16 byteRegister File Access is

Power Hungry!

How Can We ReduceRegister File Access Energy?

Page 7: Warped-Compression - University of Southern California

Opportunity: Similarity of Register Values

▪ Value Similarity is Frequently Observed on a Warp Operand• Constant Value: all thread registers in a warp have a same value

• Index Values: all thread registers have incremental values

• Low Dynamic Range: values of all thread registers are bounded in a limited range

src 1 1 1 1 1 1 1 1…

T0 T1 T2 T3 T28 T29 T30 T31

7

src 0 1 2 3 28 29 30 31…

T0 T1 T2 T3 T28 T29 T30 T31

src 127 156 156 157 172 173 168 162…

T0 T1 T2 T3 T28 T29 T30 T31

Dynamic Range: 46 (min=127, MAX=173)

Page 8: Warped-Compression - University of Southern California

Source of Value Similarity: pathfinder*

__global__ void pathfinder_kernel(int iteration, ...) {

...int tx = threadIdx.x;int bx = blockIdx.x;int small_block_cols = BLOCKSIZE-iteration*HALO*2;int blkX = small_block_cols*bx-border;int xidx = blkX+tx;

...for (int i=0; i<iteration ; i++){

computed = false;if( IN_RANGE(tx, i+1, BLOCKSIZE-i-2) && isValid){

computed = true;int left = prev[W];int up = prev[tx];int right = prev[E];int shortest = MIN(left, up);shortest = MIN(shortest, right);int index = cols*(startStep+i)+xidx;result[tx] = shortest + wall[index];...

}}

...}

Constant Values

Low Dynamic Range

Thread Index (0 ~ 1023)

Thread Block Index (0 ~ 65535)

Application Input Data (0 ~ 9)

*from Rodinia Benchmark Suite

8

Index Values

Page 9: Warped-Compression - University of Southern California

How Much is This Opportunity?

▪ On Average, 70% Thread Registers are Not Random • Zero: neighboring registers has same value

• 128 bin: neighboring registers differ by at most |128|

• 32K bin: neighboring registers differ by at most |215|

9

00.10.20.30.40.50.60.70.80.9

1

Ari

thm

etic

Dis

tan

ce

Dis

trib

uti

on

Zero 128 bin 32K bin Random

Page 10: Warped-Compression - University of Southern California

10

Exploiting Value Similarityfor

Register Compression

Page 11: Warped-Compression - University of Southern California

Register Compression

Bank0

-

-

-

Bank1

-

-

-

Bank2

-

-

-

Bank3

-

-

-…

Bank4

-

-

-

Bank5

-

-

-

…Bank6

-

-

-

Bank7

-

-

-

Bank Arbiter

Warp Operand (32-bit X 32)

- - - -

Decompression

Only 50% of RF & Wire

Active

11

Compressor

Writeback (32-bit X 32)

- - - -Comp16B

Comp16B

Comp16B

Comp16B

50% Compressed

Page 12: Warped-Compression - University of Southern California

But Is It Practical?

▪ Energy Consumption• Compression & Decompression consume extra energy

▪ Register File Access Latency• Compression & Decompression increase register file access latency

▪ Requirements for Register Compression

12

Low Latency Compression

Low Energy Compression

High Compression

Ratio

Page 13: Warped-Compression - University of Southern California

Low Latency/Energy Compression

▪ Base-Delta-Immediate (BΔI) Compression• Optimized for zero and similar value compression

• Use “base” and “delta” to represent original value

100,000,000 100,000,001 100,000,002

100,000,000

4-byte 4-byte 4-byte

4-byte

128-byte

35-byte

5 Bank Unused

Base Value

Original Data

BΔI CompressionData Representation

(Base-4, Delta-1)

100,000,031…4-byte

1 2

1-byte 1-byte

31 Delta Values

31

1-byte

Bank0

Base

Bank1 Bank2

Warp Operand (32 Thread Registers)

Δ Δ … Δ Δ Δ Δ … Δ Δ Δ …Bank3 Bank4 Bank5 Bank6 Bank7

3 Bank Used

Register File

13

Page 14: Warped-Compression - University of Southern California

BΔI Compression Parameters

▪ BΔI Can Use Various “Base” and “Delta” size• Base: 2, 4, 8-byte / Delta: 0, 1, 2-byte

• Various Base and Delta can improve compression ratio

• But also increase complexity of compression/decompression

▪ Use Single Base, Various Delta• Most of registers can be compressed by using 4-byte Base (Base 4)

• Various Delta improve compression ratio

• We use 4-byte Base and 0/1/2-byte Delta

14

0

0.2

0.4

0.6

0.8

1

AVG

BD

I Bas

e/D

elt

a Ty

pe

R

atio

Not Compressed

Base 8/Delta 4

Base 8/Delta 2

Base 8/Delta 1

Base 8/Delta 0

Base 4/Delta 2

Base 4/Delta 1

Base 4/Delta 00

0.5

1

1.5

2

AVG

Co

mp

ress

ion

Rat

ioBase 4/Delta 0 only

Base 4/Delta 1 only

Base 4/Delta 2 only

Base 4/Delta 0,1,2

Page 15: Warped-Compression - University of Southern California

Warped-Compression Architecture

▪ Compressor• Inserted in front of the register file bank

▪ Decompressor• Inserted in front of the operand collectors

▪ Bank Arbiter• Tracks which register is compressed

• What compression parameters are used

Ban

k A

rbit

er

Register Bank0

Register Bank31

Operand Collector

Inte

rco

nn

ect

Register Bank1

Operand Collector

Operand Collector

SIMDEXE

Units

IssueWarp

Scheduler

Co

mp

ress

or

Un

it A

rray

Dec

om

pre

sso

r U

nit

Arr

ay

Co

mp

ress

ion

Ran

ge

Ind

icat

or

Vec

tor

15

Page 16: Warped-Compression - University of Southern California

Dealing with Branch Divergence

▪ Branch Divergence• Partially update destination registers in a warp using the active mask

• If the destination registers are compressed, registers cannot be updated using active mask

Active Mask

Base Δ Δ Δ Δ

Compressed Destination Register

1 1 1 0 1 0 1 1

Active Mask

Execution Results

16

If (threadId % 2)

add r0, r1, r6

sub r0, r1, r6

1 0 1 0 1 0 1 0

r0

T0 T1 T2 T3 T4 T5 T6 T7

0 1 0 1 0 1 0 1

r0

T0 T1 T2 T3 T4 T5 T6 T7

True False

1 1 1 1 1 1 1 1

r0

T0 T1 T2 T3 T4 T5 T6 T7

Page 17: Warped-Compression - University of Southern California

Simplifying Branch Divergence Handling

▪ Compression Ratio in Divergent Region is Low• Thread registers in a diverged warp can have different values according to their

execution path

▪ Simple Solution: Disable Compression in Divergent Region

▪ But What If a Destination Register is Already Compressed?• Using dummy MOV instructions

17

0

1

2

3

4

5

6

Co

mp

ress

ion

Rat

io

Non-divergent Region Divergent Region Overall

N/A

N/A

N/A

N/A

N/A

N/A

8

Page 18: Warped-Compression - University of Southern California

Handling Branch Divergence (1)

▪ Turn Off Register Compression• Compression unit is disabled when the active mask contains any ‘zero’ values

▪ Decompress Destination Operand Register• Bank arbiter injects a dummy MOV instruction to the execution pipeline when a

destination register is compressed

• This dummy MOV instruction has the same src/dest register

Ban

k A

rbit

er Register

File

① Register Access Request to Read Input Operands

② Divergence Check

④ If Destination Register is Compressed, Suspend Original Request& Inject ‘Dummy MOV’ Instruction

SIMDEXE

Units

Dec

om

pre

sso

r

⑤ Read & Decompress

Co

mp

ress

or

18

Access Request

r1, r6Warp Scheduler

Operand Collector

B Δ Δ Δ Δ

Dest. Reg isCompressed

B Δ Δ Δ Δ

③ Destination Reg. r0 Check

add r0, r1, r6

mov r0, r0

Page 19: Warped-Compression - University of Southern California

Handling Branch Divergence (2)

▪ Update Register File• Write uncompressed register value by the dummy MOV instruction

• At this point, the destination register on the register file is uncompressed

▪ Resume The Suspended Request• Bank arbiter processes the suspended access request to the destination register

as conventional register access

Ban

k A

rbit

er Register

File

Operand Collector

⑦ Bank Arbiter Grants Register Write for Uncompressed Register Value

SIMDEXE

Units

Dec

om

pre

sso

r

Co

mp

ress

or

(Dis

able

d)

⑥ Writeback Uncompressed Destination Register Value

19

Access Request

r1, r6⑧ Bank Arbiter Restarts Suspended Register Access Request

B Δ Δ Δ Δ

Dest. Reg isCompressedDest. Reg is

Uncompressed

Page 20: Warped-Compression - University of Southern California

Register File Energy Saving

▪ Average Register File Energy Consumption: Reduced by 25%• Dynamic energy consumption: Reduced by register compression

• Leakage energy consumption: Reduced by unused bank-level power-gating

▪ Extra Energy Consumption of Compressor/Decompressor: Insignificant

20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Base WC

AVG

Re

gist

er

File

En

erg

y

RF Leakage

RF Dynamic

Compressor

Decompressor

Page 21: Warped-Compression - University of Southern California

Impact on Performance

▪ Performance Degradation: Negligible• 2 cycle compression + 1 cycle decompression latency = 0.1% performance loss

• Dummy MOV instructions account for less 2% of the total instruction count

0

0.2

0.4

0.6

0.8

1

1.2

Exe

uti

on

Tim

e

Baseline Warped-Compression

21

Page 22: Warped-Compression - University of Southern California

Conclusion

▪ Register Files are Power Hungry

▪ But Register File Data Exhibits Strong Value Similarity

▪ Use BΔI Compression to Exploit Value Similarity to Compress Register Data

▪ Compression is Effective• Reduce the size of a warp operand to 60%

▪ Compression is Energy Efficient• Save 25% of total register file energy consumption

▪ Compression Has Negligible Performance Impact• 0.1% degradation

22

Page 23: Warped-Compression - University of Southern California

Backup Slides

23

Page 24: Warped-Compression - University of Southern California

Evaluation Environment

▪ Simulation Parameters

▪ Benchmarks• GPGPU-sim, Rodinia benchmark suite, Parboil benchmark suite

Parameter ValueClock Frequency 1.4GHz

SMs / GPU 15

Warp Schedulers / SM 2

Warp Scheduling Policy GTO

SIMT Lane Width 32

Max # of Warps / SM 48

Max # of Threads / SM 1536

Register File Size 128 KB

Max Registers / SM 32,768

# of Register Banks 32

Bit Width / Bank 128-bit

# of Entries / Bank 256

# of Compressors 2

# of Decompressors 4

Compression Latency 2 cycle

Decompression Latency 1 cycle

Bank Wakeup Latency 10 cycle

Parameter ValueOperating Voltage 1.0 V

Wire Capacitance (45nm) 300 fF/mm

Wire Energy (128-bit) 9.6 pJ/mm

Access Energy / Bank (45nm) 7pJ

Leakage Power / Bank (45nm) 5.8 mW

Compression Unit Energy / Activation 23 pJ

Compression Unit Leakage Power 0.12 mW

Decompression Unit Energy / Activation 21 pJ

Decompression Unit Leakage Power 0.08 mW

24

Page 25: Warped-Compression - University of Southern California

Compression & Decompression Unit

▪ Simplifying BΔI• GPU Register: 32-bit

• Only use 4-byte “base” and 0/1/2-byte “delta” for compressing register values

• Only need 32-bit Adder/Subtractors, bit comparators

128-byte Original Data

32-bit Subtractor

32-bit Subtractor

32-bit Subtractor

32-bit Subtractor

32-bit Subtractor

Δ0 Δ1 Δ2 Δ3 Δ30

4-Byte Base Δ0 Δ0 Δ0 Δ0

Sign Extension Comparator

Sign Extension Comparator

Sign Extension Comparator

Sign Extension Comparator

Sign Extension Comparator

Δ0 Δ0 … Δn-1 Compressible?Yes

No

Compressed Data out Original Data out

PackingData

4-Byte Base

32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder

4-Byte Base Δ0 Δ0 Δ0 Δ0 Δ0 Δ0 … Δn-1

128-byte Original Data

Compressor Decompressor

25

Page 26: Warped-Compression - University of Southern California

How Much is This Opportunity?

▪ On Average, 79% Thread Registers are Not Random• Zero: neighboring registers has same value

• 128 bin: neighboring registers differ by at most |128|

• 32K bin: neighboring registers differ by at most |215|

00.10.20.30.40.50.60.70.80.9

1

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

No

n-d

iv

Div

LIB AES BFS CP LPS STO backp hots path srad dwt2d cutcp mri-q sad sgemm spmv stencil Avg

Ari

thm

etic

Dis

tan

ce

Dis

trib

uti

on

Zero 128 bin 32K bin Random

N/A N/A N/A N/AN/A N/A

26

Page 27: Warped-Compression - University of Southern California

Register Compression

▪ Compressed Register Data Reduces the Number of Register File Access

Bank0

Comp16B

-

-

-

Bank1

Comp16B

-

-

-

Bank2

Comp16B

-

-

-

Bank3

Comp16B

-

-

-…

Bank4

-

-

-

Bank5

-

-

-

…Bank6

-

-

-

Bank7

-

-

-

Bank Arbiter

Warp Operand (32-bit X 32)

50% CompressedData

- - - -

Decompression

Do Not Needto Access

Only 50% of RF & Wire

Active

27

Decompression

Page 28: Warped-Compression - University of Southern California

BΔI Compression Parameters

▪ BΔI Can Use Various “Base” and “Delta” size• Base: 2, 4, 8-byte / Delta: 0, 1, 2-byte

• Various Base and Delta can improve compression ratio

• But it increases complexity of compression/decompression

▪ Use Fixed Base, Various Delta• Most of registers can be compressed by using 4-byte Base (Base-4)

• GPU register granularity: 32-bit Do not need 2 or 8-byte Base

• Various Delta improve compression ratio

• We use 4-byte Base and 0/1/2-byte Delta

28

0

0.2

0.4

0.6

0.8

1

AVG

BD

I Bas

e/D

elta

Typ

e R

atio Not Compressed

Base 8/Delta 4

Base 8/Delta 2

Base 8/Delta 1

Base 8/Delta 0

Base 4/Delta 2

Base 4/Delta 1

Base 4/Delta 0

0

0.5

1

1.5

2

2.5

3

Co

mp

ress

ion

Rat

io

Base 4/Delta 0 only Base 4/Delta 1 only Base 4/Delta 2 only Base 4/Delta 0,1,25.6

Page 29: Warped-Compression - University of Southern California

Handling Branch Divergence

▪ Compression Ratio in Divergent Region is Low

▪ Solution: Disable Compression & Decompress Register Before Access• Dummy MOV instruction (which has same source-destination) used for

decompressing registers when the destination register is compressed

29

0

1

2

3

4

5

6

Co

mp

ress

ion

Rat

io

Non-divergent Region Divergent Region Overall

N/A

N/A

N/A

N/A

N/A

N/A

8

DisableCompressor

WritebackInject

Dummy MOV

DecompressDestination

Register

ResumeRegister

Write

Active MaskHas ‘0’

Complete Writeback

Target Register WritebackSuspended

Destination Register is Compressed?

Page 30: Warped-Compression - University of Southern California

Register File Energy Saving

▪ Average Register File Energy Consumption: Reduced by 25%• Dynamic energy consumption: Reduced by register compression

• Leakage energy consumption: Reduced by unused bank-level power-gating

▪ Extra Energy Consumption of Compressor/Decompressor: Insignificant

0

0.2

0.4

0.6

0.8

1

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

Bas

e

WC

LIB AES BFS CP LPS STO back hot path srad dwt2d cutcp mri-q sad sgemm spmv stencil AVG

Reg

iste

r Fi

le E

ner

gy

RF Leakage RF Dynamic Compressor Decompressor

30