View
218
Download
1
Embed Size (px)
Citation preview
1
Exploring Design Space for 3D Clustered Architectures
Manu Awasthi, Rajeev BalasubramonianUniversity of Utah
2
Device Layer 2Vertical Interconnect
Silicon
1
• Multiple layers of active devices• Vertical interconnects between layers
Device Layer
Silicon
1
Courtesy: K.Bernstein, IBM
2D Chip
3D Chip
Layer 1
Layer 2
3D TechnologiesVerySmall
~ 10µm
3
Benefits of 3D • Reduction of global interconnect
L
L
• Delay/Power reduction• Bandwidth• Mix-technology integration
4
Previous Proposals
• Previously in 3D…– Break and stack (Folding) [Puttaswamy et
al]• Vertical stacking of active devices
RegFile
Break and Stack
All are active
HEAT!!!
Reduced Intra-block
latency
5
An alternative approach?
2D Chip
3D Chip
Die 1
Die 0
Prudent Stacking Can:
• Improve Performance
• Result in better thermal profile
6
Wire Delays and Performance
Impact of wire delays
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8
Extra delay (in clock cycles)
Per
cent s
low
dow
n
DCACHE-INTALU
IQ-INTALU
RENAME-IQ
L1D-L2
BPRED-ICACHE
ICACHE-DECODE
DECODE-RENAME
DCACHE-FPALU
FPALU-INTALU
7
Clustered Architectures
• Centralized front-end– I-Cache & D-Cache– LSQ, Rename, Decode– Branch Predictor
• Clustered back-end– Issue Queue– Regfile, FUs
L1 DCache
Cluster
Crossbar/Router
Front-End
Higher clock Frequency, High ILP!!
8
Decentralized Cache Banks
L1 DCache
L1 DCache
L1 DCache
Possibly better performance
9
Decentralized Cache Banks
L1 DCache
Replicated Cache Banks
L1 DCache
L1 DCache
10
Decentralized Cache Banks
L1 DCache
Word Interleaved Cache Banks
L1 DCache
Odd Words Even Words
11
Outline
• Introduction– Motivation– 3D Architectures– Clustered Architectures
• Proposals• Results • Conclusions
12
Architecture 1
Cache-on-cluster
Die 1
Die 0
Cache Bank
Cluster
Inter Die Interconnect
Intra Die Interconnect
13
Architecture 2
Cluster-on-cluster
Die 1
Die 0
Cache Bank
Cluster
Inter Die Interconnect
Intra Die Interconnect
14
Architecture 3
Staggered
Die 1
Die 0
Cache Bank
Cluster
Inter Die Interconnect
Intra Die Interconnect
15
Outline
• Introduction– Motivation– 3D Architectures– Clustered Architectures
• Proposals• Results • Conclusions
16
Experimental Setup
• Framework– Simplescalar, Wattch and Hotspot 3.0– Wire model : 8x global metal plane
• Benchmarks– SPEC 2K, single threaded
• Processor Configuration– 8 Clusters– 64 kB L1 I/D Caches, 2 way set-assoc
• L1 Data cache Word-Interleaved or Replicated
• 2D Centralized Cache – Base Case
17
Base Case PerformancesPerformance Improvement wrt 2D Centralized Cache
0.01.02.03.04.05.06.07.08.09.0
Replicated WI
Cache Bank Type
Per
form
ance
Impr
ovem
ent Best Case 2D Config
18
The 3D EffectAverage Performance Improvement
0
2
4
6
8
10
12
14
16
Arch 1 Arch 2 Arch 3
Perc
enta
ge Im
prov
emen
t ove
r 2D
Cent
raliz
ed
3D Replicated vs 2D Centralized
19
The 3D EffectAverage performance Improvement
0
5
10
15
20
25
Arch 1 Arch 2 Arch 3Perc
enta
ge Im
pro
vem
ent over
Centr
alized
3D WI vs 2D Centralized
20
Comparisons
Average Performance Improvement wrt 2D Centralized
0
5
10
15
20
25
Arch 1 Arch 2 Arch 3
IPC
Impr
ovem
ent
Average performance Improvement wrt 2D Centralized
0
5
10
15
20
25
Arch 1 Arch 2 Arch 3IP
C Im
prov
emen
t
3D Replicated 3D WI
Best Case 3D - Rep Best Case 3D - WI
12% Improvement for best case 3D vs best case 2D
Best Case 2D
2D Case
Base Case Performance Comparisons
0
5
10
15
20
25
Replicated WI
IPC
Impr
ovem
ent
21
Thermal Analysis
• Wattch for power numbers• HotSpot 3.0 for thermal model (grid)
– 500x500 grid resolution
• Interconnect power modeling– Attributed to functional units– 8X plane wires– Router + Crossbar modeled as separate
entity
22
Thermal Profiles
0
20
40
60
80
100
120
Base Arch 1 Arch 2 Arch 3
Pea
k Tem
p - H
ottes
t U
nit (C
)
Peak Temperature : Hottest on-chip Unit (Celsius)
23
Outline
• Introduction– Motivation– 3D Architectures– Clustered Architectures
• Proposals• Results • Conclusions
24
Conclusions
• Wire delays are critical to performance– Some are more important than others.
• Prudent block stacking– Performance improvement upto 12% over
2D• WI banks + Arch 3 (3D)
– Better thermal profiles compared to folding
25
Backup Slides
26
Cluster
(a) Arch-1 (cache-on-cluster) (b) Arch-2 (cluster on cluster) (c) Arch-3 (staggered)
Cache bank Intra-die horizontal wire Inter-die vertical wire
Die 1
Die 0
4 Cluster Arrangements