Upload
keiji
View
31
Download
2
Embed Size (px)
DESCRIPTION
Synonymous Address Compaction for Energy Reduction in Data TLB. Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer Engineering College of Computing Georgia Institute of Technology Atlanta, GA 30332. Background. Address Translation - PowerPoint PPT Presentation
Citation preview
Synonymous Address Compaction Synonymous Address Compaction for Energy Reduction in Data TLBfor Energy Reduction in Data TLB
Chinnakrishnan Ballapuram
Hsien-Hsin S. Lee Hsien-Hsin S. Lee
Milos PrvulovicMilos Prvulovic
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
College of ComputingCollege of ComputingGeorgia Institute of TechnologyGeorgia Institute of Technology
Atlanta, GA 30332Atlanta, GA 30332
2Ballapuram et al., Georgia Tech
BackgroundBackground
Address Translation Major power processor power contributors I-TLB and D-TLB lookup for every instruction
and memory reference TLBs are highly associative
Multi-porting increasing power consumption
3Ballapuram et al., Georgia Tech
Outline Outline
Motivation Unique access behavior and locality are
analyzed for energy reduction opportunities
Synonymous Address Compaction Intra-Cycle Compaction Inter-Cycle Compaction Implementation Details
Performance/Energy Evaluation Conclusions
4Ballapuram et al., Georgia Tech
Breakdown of d-TLB accessesBreakdown of d-TLB accesses
More than 1 d-TLB lookup for 58% accesses (4-wide machine) They often access the same page (intra-cycle synonymous accesses)
0%
20%
40%
60%
80%
100%
1 dtlb access / clk 2 dtlb accesses / clk 3 dtlb accesses / clk 4 dtlb accesses / clk
% o
f da
ta T
LB
acc
ess
es
5Ballapuram et al., Georgia Tech
Breakdown of Synonymous Intra-cycle Accesses in d-TLBBreakdown of Synonymous Intra-cycle Accesses in d-TLB
~30% of accesses have synonyms indicating redundancy With intra-cycle compaction, 1/2 of syn(1) accesses, 2/3 of syn(2)
accesses, and 3/4 of syn(3) accesses can be eliminated
% o
f da
ta T
LB
acc
ess
es
0%
20%
40%
60%
80%
100%
syn(3) in 4 mem ref / clk
syn(2) in 4 mem ref / clk
syn(2) in 3 mem ref / clk
syn(1) in 4 mem ref / clk
syn(1) in 3 mem ref / clksyn(1) in 2 mem ref / clk
syn(0) in 4 mem ref / clk
syn(0) in 3 mem ref / clk
syn(0) in 2 mem ref / clk
syn(0) in 1 mem ref / clk
MiBench SPEC2000
6Ballapuram et al., Georgia Tech
Inter-cycle Reuse of d-TLB TranslationsInter-cycle Reuse of d-TLB Translations
Inter-cycle synonymous accesses 68% of accesses could reuse the last address translation
More reuses can be achieved by partitioning dTLB into stack (99%), global (82%), and heap (75%)
% o
f da
ta T
LB
acc
ess
es
40%
50%
60%
70%
80%
90%
100%
Baseline stack-TLB global-TLB heap-TLB
7Ballapuram et al., Georgia Tech
Dynamic Data Memory DistributionDynamic Data Memory Distribution
~40 % of the dynamic memory accesses go to the stack which is concentrated on only few pages
4 memory accesses ~= 2 stack, 1 global and 1 heap
0%
20%
40%
60%
80%
100%
stack global heap text env
8Ballapuram et al., Georgia Tech
Semantic-Aware Memory ArchitectureSemantic-Aware Memory Architecture
To Processor
Unified L2 Cache
Data Address Router
gCachehCache
ld_data_base_regld_env_base_reg
ld_data_bound_reg
gTLB 0 1 2 3
To Processor
Virtual address
uTLB 0 1
63
Most of the memory accesss Most of the memory accesss go to smaller stack and go to smaller stack and global TLB/cacheglobal TLB/cache Reducing powerReducing power
sTLB 0
1
sCache
9Ballapuram et al., Georgia Tech
VPN compaction mechanismsVPN compaction mechanisms0xdeadbeee 0xdeadbeef 0xdeadbef0Cycle i
Cycle (i+1) 0xdeadbef2 0xdeadbeef 0x12345678
0xffffffff
-----
0xdeadb 0xdeadb 0xdeadbCycle i
Cycle (i+1) 0xdeadb 0xdeadb 0x12345
0xfffff
-----
Virtual address access sequenceVirtual address access sequence
VPN translation lookup in d-TLBVPN translation lookup in d-TLB
10Ballapuram et al., Georgia Tech
VPN compaction mechanismsVPN compaction mechanisms0xdeadbeee 0xdeadbeef 0xdeadbef0Cycle i
Cycle (i+1) 0xdeadbef2 0xdeadbeef 0x12345678
0xffffffff
-----
Intra-cycle compactionIntra-cycle compaction
0xdeadb 0xdeadb 0xdeadbCycle i
Cycle (i+1) 0xdeadb 0xdeadb 0x12345
0xfffff
-----
Virtual address access sequenceVirtual address access sequence
VPN translation lookup in d-TLBVPN translation lookup in d-TLB
0xdeadb ----- -----Cycle i
Cycle (i+1) 0xdeadb ----- 0x12345
0xffffffff
-----
VPNs after intra-cycle compactionVPNs after intra-cycle compaction
11Ballapuram et al., Georgia Tech
VPN compaction mechanismsVPN compaction mechanisms0xdeadbeee 0xdeadbeef 0xdeadbef0Cycle i
Cycle (i+1) 0xdeadbef2 0xdeadbeef 0x12345678
0xffffffff
-----
Intra-cycle compactionIntra-cycle compaction
0xdeadb 0xdeadb 0xdeadbCycle i
Cycle (i+1) 0xdeadb 0xdeadb 0x12345
0xfffff
-----
Virtual address access sequenceVirtual address access sequence
VPN translation lookup in d-TLBVPN translation lookup in d-TLB
Inter-cycle compactionInter-cycle compaction
0xdeadb ----- -----Cycle i
Cycle (i+1) 0xdeadb ----- 0x12345
0xffffffff
-----
VPNs after intra-cycle compactionVPNs after intra-cycle compaction
0xdeadb 0xdeadb 0xdeadbCycle i
Cycle (i+1) ----- ----- 0x12345
0xfffff
-----
VPNs after inter-cycle compactionVPNs after inter-cycle compaction
12Ballapuram et al., Georgia Tech
Intra-cycle compaction mechanismIntra-cycle compaction mechanism
Reservation Station
AGUs FPUsIUs
LoadBuffer
StoreBuffer
Six 20-bit comparators
32-entry fully-associativeData TLBs
Memory OrderBuffer
Physical Address
AGUs IUs
13Ballapuram et al., Georgia Tech
Comparator LogicComparator Logic
1 = 2 = 3 = 4
VPN1 == VPN2
VPN1 == VPN3
VPN1 == VPN4
VPN2 == VPN3
VPN2 == VPN4
VPN3 == VPN4
1 = 3 = 4
2 = 3 = 4
1 = 2 = 4
1 = 2 = 3
MEMORY ORDER BUFFER
1 = 2
1 = 3
1 = 4
2 = 3
2 = 4
3 = 4
14Ballapuram et al., Georgia Tech
Inter-cycle Compaction MechanismInter-cycle Compaction Mechanism
To Processor
Unified L2 Cache
Data Address Router
gCachehCache
ld_data_base_regld_env_base_reg
ld_data_bound_reg
gTLB 0 1 2 3
To Processor
Virtual address
uTLB 0
32
sCache
sTLB 0
1
MRU Latch
MRU Latch
MRU Latch
last access reuse
last access reuse
15Ballapuram et al., Georgia Tech
Execution Engine Out-of-Order
Fetch / Decode / Issue / Commit 4 / 4 / 4 / 4
L1 / L2 / Memory Latency 1 / 6 / 150
TLB hit / miss latency 1 / 30
L1 Cache baseline DM 32KB, 32B
L2 Cache 4w 512KB, 32B
Number of TLB entries 32
Each 20-bit comparator power 300 uW
Each MRU latch power in TLB 140 uW
Simulation ParametersSimulation Parameters
16Ballapuram et al., Georgia Tech
Energy Savings via Synonymous CompactionEnergy Savings via Synonymous Compaction
Intra-cycle compaction 27% Inter-cycle compaction 42% Inter-cycle semantic-aware 56%
dat
a T
LB
Ene
rgy
Sa
vin
gs %
0%
20%
40%
60%
80%
Inter-Cycle: semnatic-aware stack-only Inter-Cycle: semantic-aware stack + global
Inter-Cycle: semantic-aware stack + heap Inter-Cycle: all semantic-aware
Inter-Cycle: applied to baseline d-TLB Intra-Cycle: applied to baseline d-TLB
17Ballapuram et al., Georgia Tech
Performance Impact w/ Synonymous CompactionPerformance Impact w/ Synonymous Compaction
Intra-cycle compaction 9% Inter-cycle compaction 8% Inter-cycle semantic-aware 4%
Perf
orm
an
ce S
peed
up
Perf
orm
an
ce S
peed
up
0.80
0.84
0.88
0.92
0.96
1.00
Inter-Cycle: semantic-aware stack only Inter-Cycle: semantic-aware stack+global
Inter-Cycle: semantic-aware stack+heap Inter-Cycle: all semantic-aware
Inter-Cycle: applied to baseline d-TLB Intra-Cycle: applied to baseline d-TLB
18Ballapuram et al., Georgia Tech
I- and d-TLB Energy Savings via Synonymous CompactionI- and d-TLB Energy Savings via Synonymous Compaction
Combining compaction for iTLB and dTLB gives 85% and 52% energy savings
Overall 70% TLB energy savings Using semantic-aware, overall 76% energy savings
TL
B E
ner
gy S
avi
ng
s %
0%
20%
40%
60%
80%
100%
Intra-Cycle + Inter-Cycle: applied to baseline iTLB
Intra-Cycle + Inter-Cycle: applied to baseline dTLB
Intra-Cycle + Inter-Cycle: applied to both iTLB + baseline dTLBIntra-Cycle + Inter-Cycle: applied to both iTLB + all semantic-aware
19Ballapuram et al., Georgia Tech
Combining compaction for iTLB and dTLB have 5% and 13% performance impact
Using semantic-aware, overall 13% performance impact
Pe
rfo
rman
ce S
pee
dup
0.80
0.84
0.88
0.92
0.96
1.00
Intra-Cycle + Inter-Cycle: applied to baseline iTLB
Intra-Cycle + Inter-Cycle: applied to baseline dTLB
Intra-Cycle + Inter-Cycle: iTLB + dTLB
Intra-Cycle + Inter-Cycle: iTLB + all semantic-aware
I- and d-TLB Performance Impact w/ Synonymous CompactionI- and d-TLB Performance Impact w/ Synonymous Compaction
20Ballapuram et al., Georgia Tech
ConclusionsConclusions
Consecutive TLB accesses are highly synonymous
Proposed synonymous address compaction to exploit this behavior
Reduce energy for d-TLB and i-TLB Energy savings and performance impact
Intra-cycle 27% and 9% Inter-cycle 42% and 8% Semantic-aware 56% and 4%
Q and AQ and A