Upload
otylia
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
MacSim Architecture Studies. Architecture Studies Using MacSim. Thread fetch policies Branch predictor. Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies. Power model. Front-end. Memory System. Misc. Prefetcher Study. MacSim. - PowerPoint PPT Presentation
Citation preview
MacSim Tutorial (In ISCA-39, 2012)
Architecture Studies Using MacSim
• Thread fetch policies
• Branch predictor
• Software and Hardware prefetcher
• Cache studies (sharing, inclusion)
• DRAM scheduling• Interconnection
studies
• Power model
Front-end Memory System Misc.
2/8
MacSim Tutorial (In ISCA-39, 2012)
Prefetcher Study
Memory System
Trace Generator(PIN, GPUOCelot)
Hardware Prefetcher
FrontendSoftware prefetch instructionsPTX prefetch, prefetchux86 prefetcht0, prefetcht1, prefetchnta
Hardware prefetch requests
Stream, stride, GHB, …
• Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]• When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]
MacSim
3/8
MacSim Tutorial (In ISCA-39, 2012)
Cache and NoC Studies| Cache studies – sharing, inclusion property| On-chip interconnection studies
• TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]
$ $$ $ $ $ $
Shared $
Interconnection
Private Caches
Interconnection
Shared Cache
4/8
MacSim Tutorial (In ISCA-39, 2012)
Heterogeneity Aware NoC| Heterogeneous link configuration
Ring NetworkGPU
CPU
L3
MC
Different topologies
C C M M
C C M M
C C G G
C C G G
C0
L3
G0
M1
C1 C2 G1 G2
M0 L3 L3 L3
C0
L3
G0
M1
C1C2 G1 G2
M0 L3 L3 L3
• On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review]
5/8
MacSim Tutorial (In ISCA-39, 2012)
Instruction Fetch and DRAM Scheduling
Execution
Trace Generator(GPUOCelot) Frontend
• Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010]
DRAM
RR, ICOUNT, FAIR, LRF, …
FCFS, FRFCFS, FAIR, …
6/8
MacSim Tutorial (In ISCA-39, 2012)
DRAM Scheduling in GPGPUsDRAM Bank
DRAM Controller
Core-0 Core-1
Qs for Core-0
RHRMRMRMRM
RHRMRMRM
RHRMRM
W0 W1 W2 W3
Tolerance(Core-0) < Tolerance(Core-1)
Qs for Core-1
RHRMRMRM
RHW0 W1 W2 W3
Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α
= 4α + 3α + 5α (α < 1)
Reduction in potential if:row hit from queue of length L is serviced next Lα – (L – 1)α
row hit from queue of length L is serviced next Lα – (L – 1/m)α
m = cost of servicing row miss/cost of servicing row hit
Tolerance(Core-0) < Tolerance(Core-1) select Core-0
Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next
• DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011]
7/8
MacSim Tutorial (In ISCA-39, 2012)
Power Research & Validation
| Verifying simulator and GTX580| Modeling X86-CPU power| Modeling GPU power
Still on-going research
8/8
Fetch3%
Decode1% Schedule
3%
RF4%
EX_alu6%
EX_fpu48%EX_SFU
1%
EX_LD/ST3%
Execution0%
MMU0%
L126%
SharedMem1%
ConstCache1%
TextureCache1%