Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Warped-Compression:Enabling Power Efficient GPUs through
Register Compression
Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*)
Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC)
(*Work done while visiting USC)
Short Summary
TargetRegister File on GPUs
ProblemEnergy Consumption of Register File
SolutionData Compression on Register File
ResultsReducing 25% of Register File Energy Consumption
2
Motivation: Register Power Consumption
▪ GPUs Need Large Register Files to Maximize TLP
▪ Register File Contributes Significant Portion of the Total GPU Chip Power
▪ Register File Size Has Been Growing
512KB
Tesla(G80/G92)
1920KB
Tesla(GT200)
2048KB
Fermi(GF110)
3840KB
Kepler(GK110)
Maxwell(GM200)
6144KB
Estimated GeForce GTX480 (Fermi) Component Power
Consumption*
*Leng et al., “GPUWattch : Enabling Energy Optimizations in GPGPUs”
3
dst
src1
src2
Motivation: GPU Register Characteristics
▪ Warp: A Bundle of 32 Threads
▪ Operands of a Warp: A Bundle of 32 Thread Registers• This bundle of registers is treated as a single instruction operand in GPUs
Warp Instruction(add.u32 %r0, %r1, %r6)
r0 r0 r0 r0 r0 r0 r0 r0…
r1 r1 r1 r1 r1 r1 r1 r1…
r6 r6 r6 r6 r6 r6 r6 r6
add.u32 %r0, %r1, %r6;...
32-bit Registers X 32 (128-byte)
…
T0 T1 T2 T3 T28 T29 T30 T31…
4
Baseline Register File
▪ Multi-banked Register File*• 4KB per bank, 32 banks
• 128-bit wide single read/write port – Provides 4 thread operands per bank
• Access 8 banks for collecting a warp operand
Bank0
-
-
-
…
Bank1
-
-
-
…
Bank2
-
-
-
…
Bank3
-
-
-…
Bank4
-
-
-
…
Bank5
-
-
-
…Bank6
-
-
-
…
Bank7
-
-
-
…
…
Bank Arbiter
Operand Collector Buffer (32-bit X 32)
4KB Bank(128-bit Wide)
5
*Gebhart et al., “Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors”
16 byte
16 byte
16 byte
16 byte
16 byte
16 byte
16 byte
16 byte
Register File Access Energy
▪ Accessing Warp Operand Registers Activates Multiple Banks• Bank access energy + wire energy
…
4KB SRAMAccess Energy1
7pJ
128-bitWire Energy2
9.6pJ/mm
Access Energy/Warp Operand : (7 + 9.6)*8 = 132.8pJ1CACTI (1.0V, 45nm)2Gebhart et al., “Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors” (1.0v, 40nm) 6
1mm
Bank0
-
-
-
…
Bank1
-
-
-
…
Bank2
-
-
-
…
Bank3
-
-
-
…
Bank4
-
-
-
…
Bank5
-
-
-
…Bank6
-
-
-
…
Bank7
-
-
-
…
Bank Arbiter
Operand Collector Buffer (32-bit X 32)
16 byte
16 byte
16 byte
16 byte
16 byte
16 byte
16 byte
16 byteRegister File Access is
Power Hungry!
How Can We ReduceRegister File Access Energy?
Opportunity: Similarity of Register Values
▪ Value Similarity is Frequently Observed on a Warp Operand• Constant Value: all thread registers in a warp have a same value
• Index Values: all thread registers have incremental values
• Low Dynamic Range: values of all thread registers are bounded in a limited range
src 1 1 1 1 1 1 1 1…
T0 T1 T2 T3 T28 T29 T30 T31
7
src 0 1 2 3 28 29 30 31…
T0 T1 T2 T3 T28 T29 T30 T31
src 127 156 156 157 172 173 168 162…
T0 T1 T2 T3 T28 T29 T30 T31
Dynamic Range: 46 (min=127, MAX=173)
Source of Value Similarity: pathfinder*
__global__ void pathfinder_kernel(int iteration, ...) {
...int tx = threadIdx.x;int bx = blockIdx.x;int small_block_cols = BLOCKSIZE-iteration*HALO*2;int blkX = small_block_cols*bx-border;int xidx = blkX+tx;
...for (int i=0; i<iteration ; i++){
computed = false;if( IN_RANGE(tx, i+1, BLOCKSIZE-i-2) && isValid){
computed = true;int left = prev[W];int up = prev[tx];int right = prev[E];int shortest = MIN(left, up);shortest = MIN(shortest, right);int index = cols*(startStep+i)+xidx;result[tx] = shortest + wall[index];...
}}
...}
Constant Values
Low Dynamic Range
Thread Index (0 ~ 1023)
Thread Block Index (0 ~ 65535)
Application Input Data (0 ~ 9)
*from Rodinia Benchmark Suite
8
Index Values
How Much is This Opportunity?
▪ On Average, 70% Thread Registers are Not Random • Zero: neighboring registers has same value
• 128 bin: neighboring registers differ by at most |128|
• 32K bin: neighboring registers differ by at most |215|
9
00.10.20.30.40.50.60.70.80.9
1
Ari
thm
etic
Dis
tan
ce
Dis
trib
uti
on
Zero 128 bin 32K bin Random
10
Exploiting Value Similarityfor
Register Compression
Register Compression
Bank0
-
-
-
…
Bank1
-
-
-
…
Bank2
-
-
-
…
Bank3
-
-
-…
Bank4
-
-
-
…
Bank5
-
-
-
…Bank6
-
-
-
…
Bank7
-
-
-
…
…
Bank Arbiter
Warp Operand (32-bit X 32)
- - - -
Decompression
Only 50% of RF & Wire
Active
11
Compressor
Writeback (32-bit X 32)
- - - -Comp16B
Comp16B
Comp16B
Comp16B
50% Compressed
But Is It Practical?
▪ Energy Consumption• Compression & Decompression consume extra energy
▪ Register File Access Latency• Compression & Decompression increase register file access latency
▪ Requirements for Register Compression
12
Low Latency Compression
Low Energy Compression
High Compression
Ratio
Low Latency/Energy Compression
▪ Base-Delta-Immediate (BΔI) Compression• Optimized for zero and similar value compression
• Use “base” and “delta” to represent original value
100,000,000 100,000,001 100,000,002
100,000,000
4-byte 4-byte 4-byte
4-byte
128-byte
35-byte
5 Bank Unused
Base Value
Original Data
BΔI CompressionData Representation
(Base-4, Delta-1)
100,000,031…4-byte
1 2
1-byte 1-byte
31 Delta Values
31
1-byte
…
Bank0
Base
Bank1 Bank2
Warp Operand (32 Thread Registers)
Δ Δ … Δ Δ Δ Δ … Δ Δ Δ …Bank3 Bank4 Bank5 Bank6 Bank7
3 Bank Used
Register File
13
BΔI Compression Parameters
▪ BΔI Can Use Various “Base” and “Delta” size• Base: 2, 4, 8-byte / Delta: 0, 1, 2-byte
• Various Base and Delta can improve compression ratio
• But also increase complexity of compression/decompression
▪ Use Single Base, Various Delta• Most of registers can be compressed by using 4-byte Base (Base 4)
• Various Delta improve compression ratio
• We use 4-byte Base and 0/1/2-byte Delta
14
0
0.2
0.4
0.6
0.8
1
AVG
BD
I Bas
e/D
elt
a Ty
pe
R
atio
Not Compressed
Base 8/Delta 4
Base 8/Delta 2
Base 8/Delta 1
Base 8/Delta 0
Base 4/Delta 2
Base 4/Delta 1
Base 4/Delta 00
0.5
1
1.5
2
AVG
Co
mp
ress
ion
Rat
ioBase 4/Delta 0 only
Base 4/Delta 1 only
Base 4/Delta 2 only
Base 4/Delta 0,1,2
Warped-Compression Architecture
▪ Compressor• Inserted in front of the register file bank
▪ Decompressor• Inserted in front of the operand collectors
▪ Bank Arbiter• Tracks which register is compressed
• What compression parameters are used
Ban
k A
rbit
er
Register Bank0
Register Bank31
Operand Collector
Inte
rco
nn
ect
Register Bank1
…
Operand Collector
Operand Collector
…
SIMDEXE
Units
IssueWarp
Scheduler
Co
mp
ress
or
Un
it A
rray
Dec
om
pre
sso
r U
nit
Arr
ay
Co
mp
ress
ion
Ran
ge
Ind
icat
or
Vec
tor
15
Dealing with Branch Divergence
▪ Branch Divergence• Partially update destination registers in a warp using the active mask
• If the destination registers are compressed, registers cannot be updated using active mask
Active Mask
Base Δ Δ Δ Δ
Compressed Destination Register
1 1 1 0 1 0 1 1
Active Mask
Execution Results
16
If (threadId % 2)
add r0, r1, r6
sub r0, r1, r6
…
1 0 1 0 1 0 1 0
r0
T0 T1 T2 T3 T4 T5 T6 T7
0 1 0 1 0 1 0 1
r0
T0 T1 T2 T3 T4 T5 T6 T7
True False
1 1 1 1 1 1 1 1
r0
T0 T1 T2 T3 T4 T5 T6 T7
Simplifying Branch Divergence Handling
▪ Compression Ratio in Divergent Region is Low• Thread registers in a diverged warp can have different values according to their
execution path
▪ Simple Solution: Disable Compression in Divergent Region
▪ But What If a Destination Register is Already Compressed?• Using dummy MOV instructions
17
0
1
2
3
4
5
6
Co
mp
ress
ion
Rat
io
Non-divergent Region Divergent Region Overall
N/A
N/A
N/A
N/A
N/A
N/A
8
Handling Branch Divergence (1)
▪ Turn Off Register Compression• Compression unit is disabled when the active mask contains any ‘zero’ values
▪ Decompress Destination Operand Register• Bank arbiter injects a dummy MOV instruction to the execution pipeline when a
destination register is compressed
• This dummy MOV instruction has the same src/dest register
Ban
k A
rbit
er Register
File
① Register Access Request to Read Input Operands
② Divergence Check
④ If Destination Register is Compressed, Suspend Original Request& Inject ‘Dummy MOV’ Instruction
SIMDEXE
Units
Dec
om
pre
sso
r
⑤ Read & Decompress
Co
mp
ress
or
18
Access Request
r1, r6Warp Scheduler
Operand Collector
B Δ Δ Δ Δ
Dest. Reg isCompressed
B Δ Δ Δ Δ
③ Destination Reg. r0 Check
add r0, r1, r6
mov r0, r0
Handling Branch Divergence (2)
▪ Update Register File• Write uncompressed register value by the dummy MOV instruction
• At this point, the destination register on the register file is uncompressed
▪ Resume The Suspended Request• Bank arbiter processes the suspended access request to the destination register
as conventional register access
Ban
k A
rbit
er Register
File
Operand Collector
⑦ Bank Arbiter Grants Register Write for Uncompressed Register Value
SIMDEXE
Units
Dec
om
pre
sso
r
Co
mp
ress
or
(Dis
able
d)
⑥ Writeback Uncompressed Destination Register Value
19
Access Request
r1, r6⑧ Bank Arbiter Restarts Suspended Register Access Request
B Δ Δ Δ Δ
Dest. Reg isCompressedDest. Reg is
Uncompressed
Register File Energy Saving
▪ Average Register File Energy Consumption: Reduced by 25%• Dynamic energy consumption: Reduced by register compression
• Leakage energy consumption: Reduced by unused bank-level power-gating
▪ Extra Energy Consumption of Compressor/Decompressor: Insignificant
20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Base WC
AVG
Re
gist
er
File
En
erg
y
RF Leakage
RF Dynamic
Compressor
Decompressor
Impact on Performance
▪ Performance Degradation: Negligible• 2 cycle compression + 1 cycle decompression latency = 0.1% performance loss
• Dummy MOV instructions account for less 2% of the total instruction count
0
0.2
0.4
0.6
0.8
1
1.2
Exe
uti
on
Tim
e
Baseline Warped-Compression
21
Conclusion
▪ Register Files are Power Hungry
▪ But Register File Data Exhibits Strong Value Similarity
▪ Use BΔI Compression to Exploit Value Similarity to Compress Register Data
▪ Compression is Effective• Reduce the size of a warp operand to 60%
▪ Compression is Energy Efficient• Save 25% of total register file energy consumption
▪ Compression Has Negligible Performance Impact• 0.1% degradation
22
Backup Slides
23
Evaluation Environment
▪ Simulation Parameters
▪ Benchmarks• GPGPU-sim, Rodinia benchmark suite, Parboil benchmark suite
Parameter ValueClock Frequency 1.4GHz
SMs / GPU 15
Warp Schedulers / SM 2
Warp Scheduling Policy GTO
SIMT Lane Width 32
Max # of Warps / SM 48
Max # of Threads / SM 1536
Register File Size 128 KB
Max Registers / SM 32,768
# of Register Banks 32
Bit Width / Bank 128-bit
# of Entries / Bank 256
# of Compressors 2
# of Decompressors 4
Compression Latency 2 cycle
Decompression Latency 1 cycle
Bank Wakeup Latency 10 cycle
Parameter ValueOperating Voltage 1.0 V
Wire Capacitance (45nm) 300 fF/mm
Wire Energy (128-bit) 9.6 pJ/mm
Access Energy / Bank (45nm) 7pJ
Leakage Power / Bank (45nm) 5.8 mW
Compression Unit Energy / Activation 23 pJ
Compression Unit Leakage Power 0.12 mW
Decompression Unit Energy / Activation 21 pJ
Decompression Unit Leakage Power 0.08 mW
24
Compression & Decompression Unit
▪ Simplifying BΔI• GPU Register: 32-bit
• Only use 4-byte “base” and 0/1/2-byte “delta” for compressing register values
• Only need 32-bit Adder/Subtractors, bit comparators
128-byte Original Data
32-bit Subtractor
32-bit Subtractor
32-bit Subtractor
32-bit Subtractor
32-bit Subtractor
Δ0 Δ1 Δ2 Δ3 Δ30
4-Byte Base Δ0 Δ0 Δ0 Δ0
Sign Extension Comparator
Sign Extension Comparator
Sign Extension Comparator
Sign Extension Comparator
Sign Extension Comparator
Δ0 Δ0 … Δn-1 Compressible?Yes
…
No
Compressed Data out Original Data out
…
…
PackingData
4-Byte Base
32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder 32-bit Adder
4-Byte Base Δ0 Δ0 Δ0 Δ0 Δ0 Δ0 … Δn-1
128-byte Original Data
…
Compressor Decompressor
25
How Much is This Opportunity?
▪ On Average, 79% Thread Registers are Not Random• Zero: neighboring registers has same value
• 128 bin: neighboring registers differ by at most |128|
• 32K bin: neighboring registers differ by at most |215|
00.10.20.30.40.50.60.70.80.9
1
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
No
n-d
iv
Div
LIB AES BFS CP LPS STO backp hots path srad dwt2d cutcp mri-q sad sgemm spmv stencil Avg
Ari
thm
etic
Dis
tan
ce
Dis
trib
uti
on
Zero 128 bin 32K bin Random
N/A N/A N/A N/AN/A N/A
26
Register Compression
▪ Compressed Register Data Reduces the Number of Register File Access
Bank0
Comp16B
-
-
-
…
Bank1
Comp16B
-
-
-
…
Bank2
Comp16B
-
-
-
…
Bank3
Comp16B
-
-
-…
Bank4
-
-
-
…
Bank5
-
-
-
…Bank6
-
-
-
…
Bank7
-
-
-
…
…
Bank Arbiter
Warp Operand (32-bit X 32)
50% CompressedData
- - - -
Decompression
Do Not Needto Access
Only 50% of RF & Wire
Active
27
Decompression
BΔI Compression Parameters
▪ BΔI Can Use Various “Base” and “Delta” size• Base: 2, 4, 8-byte / Delta: 0, 1, 2-byte
• Various Base and Delta can improve compression ratio
• But it increases complexity of compression/decompression
▪ Use Fixed Base, Various Delta• Most of registers can be compressed by using 4-byte Base (Base-4)
• GPU register granularity: 32-bit Do not need 2 or 8-byte Base
• Various Delta improve compression ratio
• We use 4-byte Base and 0/1/2-byte Delta
28
0
0.2
0.4
0.6
0.8
1
AVG
BD
I Bas
e/D
elta
Typ
e R
atio Not Compressed
Base 8/Delta 4
Base 8/Delta 2
Base 8/Delta 1
Base 8/Delta 0
Base 4/Delta 2
Base 4/Delta 1
Base 4/Delta 0
0
0.5
1
1.5
2
2.5
3
Co
mp
ress
ion
Rat
io
Base 4/Delta 0 only Base 4/Delta 1 only Base 4/Delta 2 only Base 4/Delta 0,1,25.6
Handling Branch Divergence
▪ Compression Ratio in Divergent Region is Low
▪ Solution: Disable Compression & Decompress Register Before Access• Dummy MOV instruction (which has same source-destination) used for
decompressing registers when the destination register is compressed
29
0
1
2
3
4
5
6
Co
mp
ress
ion
Rat
io
Non-divergent Region Divergent Region Overall
N/A
N/A
N/A
N/A
N/A
N/A
8
DisableCompressor
WritebackInject
Dummy MOV
DecompressDestination
Register
ResumeRegister
Write
Active MaskHas ‘0’
Complete Writeback
Target Register WritebackSuspended
Destination Register is Compressed?
Register File Energy Saving
▪ Average Register File Energy Consumption: Reduced by 25%• Dynamic energy consumption: Reduced by register compression
• Leakage energy consumption: Reduced by unused bank-level power-gating
▪ Extra Energy Consumption of Compressor/Decompressor: Insignificant
0
0.2
0.4
0.6
0.8
1
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
Bas
e
WC
LIB AES BFS CP LPS STO back hot path srad dwt2d cutcp mri-q sad sgemm spmv stencil AVG
Reg
iste
r Fi
le E
ner
gy
RF Leakage RF Dynamic Compressor Decompressor
30