MacSim Architecture Studies

MacSim Tutorial (In ISCA-39, 2012)1

MacSim Architecture Studies

MacSim Tutorial (In ISCA-39, 2012)

Architecture Studies Using MacSim

• Thread fetch policies

• Branch predictor

• Software and Hardware prefetcher

• Cache studies (sharing, inclusion)

• DRAM scheduling• Interconnection

studies

• Power model

Front-end Memory System Misc.

2/8


Prefetcher Study

Memory System

Trace Generator(PIN, GPUOCelot)

Hardware Prefetcher

FrontendSoftware prefetch instructionsPTX prefetch, prefetchux86 prefetcht0, prefetcht1, prefetchnta

Hardware prefetch requests

Stream, stride, GHB, …

• Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]• When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]

MacSim

3/8


Cache and NoC Studies| Cache studies – sharing, inclusion property| On-chip interconnection studies

• TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]

$ $$ $ $ $ $

Shared $

Interconnection

Private Caches

Interconnection

Shared Cache

4/8


Heterogeneity Aware NoC| Heterogeneous link configuration

Ring NetworkGPU

CPU

L3

MC

Different topologies

C C M M

C C M M

C C G G

C C G G

C0

L3

G0

M1

C1 C2 G1 G2

M0 L3 L3 L3

C0

L3

G0

M1

C1C2 G1 G2

M0 L3 L3 L3

• On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review]

5/8


Instruction Fetch and DRAM Scheduling

Execution

Trace Generator(GPUOCelot) Frontend

• Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010]

DRAM

RR, ICOUNT, FAIR, LRF, …

FCFS, FRFCFS, FAIR, …

6/8


DRAM Scheduling in GPGPUsDRAM Bank

DRAM Controller

Core-0 Core-1

Qs for Core-0

RHRMRMRMRM

RHRMRMRM

RHRMRM

W0 W1 W2 W3

Tolerance(Core-0) < Tolerance(Core-1)

Qs for Core-1

RHRMRMRM

RHW0 W1 W2 W3

Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α

= 4α + 3α + 5α (α < 1)

Reduction in potential if:row hit from queue of length L is serviced next Lα – (L – 1)α

row hit from queue of length L is serviced next Lα – (L – 1/m)α

m = cost of servicing row miss/cost of servicing row hit

Tolerance(Core-0) < Tolerance(Core-1) select Core-0

Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next

• DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011]

7/8


Power Research & Validation

| Verifying simulator and GTX580| Modeling X86-CPU power| Modeling GPU power

Still on-going research

8/8

Fetch3%

Decode1% Schedule

3%

RF4%

EX_alu6%

EX_fpu48%EX_SFU

1%

EX_LD/ST3%

Execution0%

MMU0%

L126%

SharedMem1%

ConstCache1%

TextureCache1%


MacSim’s Roadmap

2012 ~ 2013

Power/Energy Model

ARM ArchitectureMobile Platform

OpenGL Program

Documents

MacSim Architecture Studies