SCIMA: Software Controlled Integrated Memory Architecture for HPC

SCIMA: Software Controlled Integrated Memory Architecture for HPC

• Background– Memory wall problem– Conventional Cache is not good in HPC

– unwilling line conflict– fixed size of Off-Chip Memory access

• Solution: SCIMA (Software Controlled Integrated Memory Architecture)– strategy: software controllability– addressable On-Chip Memory in addition to conventional cache

– On-Chip Memory and cache are reconfigurable– explicit data transfer between On-Chip Memory and Off-Chip Memory

by page-load/page-store instruction – burst transfer and stride transfer are supported

Address Space

On-Chip Memory Features Tb T ｌ T ｔ

software controllability - - ↓

page-load/page-store(burst) ↑ ↓ -

page-load/page-store(stride) ↑ ↓ ↓

scheduling for page-load/page-store - ↓ -

Latency Tolerating Techniques of Cache Tb T ｌ T ｔ

larger cache line - ↓ ↑lock-up free cache - ↓ ↑cache prefetching ↑ ↓ ↑

• Schematic View

advantages of SCIMA

On-ChipMem.

register

ALU FPU

Memory(DRAM)

・・・ NIA

NetworkOverview of SCIMA

cacheOn-Chip Mem.

Tb: CPU busy timeTl: Latency stall timeTt: Throughput stall time

L1 cache

SCIMA: Experimental Results• SCIMA provides various data placement and utilization

scheme according to the characteristics of data access

• Evaluation Results

FT

QCD

Throughput ratio=2:1Latency=40 future technology

trend

Throughput ratio=8:1Latency=160

latency-stall reduction by burst transfer

throughput-stall reduction by software controllabilitylatency & throughput-stall reduction by stride transfer

:::

Throughput ratio=2:1Latency=40 future technology

trend

Throughput ratio=8:1Latency=160

Latency/Throughput stall is reduced for wide variety of data access

SCIMA is robust to large throughput ratio and long memory access latency caused by current technology trend of CPU-memory speed

gap

0.0E+001.0E+072.0E+073.0E+074.0E+075.0E+076.0E+077.0E+078.0E+07

Cache SCIMA Cache SCIMA

64K 512K

Exec

utio

n C

ycle

s throughput stalllatency stallCPU busy time

0.0E+002.0E+074.0E+076.0E+078.0E+071.0E+081.2E+081.4E+081.6E+08


64K 512K

Exec

utio

n C

ycle

s

0.0E+00

5.0E+07

1.0E+08

1.5E+08

2.0E+08

2.5E+08

3.0E+08


64K 512K

Exec

utio

n C

ycle

s throughput stalllatency stallCPU busy time

0.0E+00

1.0E+08

2.0E+08

3.0E+08

4.0E+08

5.0E+08

6.0E+08


64K 512K

Exec

utio

n C

ycle

s

consecutive-ness

reusabilityreusablenot-reusable

consecutive

irregular

stride

use cache

use On-Chip Mem. as a stream buffer

use On-Chip Mem. as a stream buffer

reserve On-Chip 　 Mem. for reused data

reserve On-Chip Mem. for reused datareserve On-Chip Mem. for reused data

Throughput Ratio = Ratio between on-chip and off-chip memory throughputLatency = Memory access latency for off-chip memory (latency for the first data)

Documents

SCIMA: Software Controlled Integrated Memory Architecture for HPC