2
SCIMA: Software Controlled Integrated Memory Architecture for HPC Background Memory wall problem Conventional Cache is not good in HPC unwilling line conflict fixed size of Off-Chip Memory access Solution: SCIMA (Software Controlled Integrated Memory Archite cture) strategy: software controllability addressable On-Chip Memory in addition to convention al cache On-Chip Memory and cache are reconfigurable explicit data transfer between On-Chip Memory and Of f-Chip Memory by page-load/page-store instruction burst transfer and stride transfer are supported Address Space On-Chip Memory Features T b T T software controllability - - page-load/page-store(burst) - page-load/page-store(stride) scheduling for page-load/page-store - - Latency Tolerating Techniques of Cache T b T T larger cache line - lock-up free cache - cache prefetching Schematic View advantages of SCIMA On-Chip Mem. register ALU FPU Memory (DRAM) ・・・ NIA Network Overview of SCIMA cach e On-Chi p Mem. T b : CPU busy time T l : Latency stall t ime T t : Throughput stal l time L1 cache

SCIMA: Software Controlled Integrated Memory Architecture for HPC

Embed Size (px)

DESCRIPTION

T b : CPU busy time T l : Latency stall time T t : Throughput stall time. NIA. ・・・. Memory. (DRAM). Network. SCIMA: Software Controlled Integrated Memory Architecture for HPC. Background Memory wall problem Conventional Cache is not good in HPC unwilling line conflict - PowerPoint PPT Presentation

Citation preview

Page 1: SCIMA: Software Controlled Integrated Memory Architecture for HPC

SCIMA: Software Controlled Integrated Memory Architecture for HPC

• Background– Memory wall problem– Conventional Cache is not good in HPC

– unwilling line conflict– fixed size of Off-Chip Memory access

• Solution: SCIMA (Software Controlled Integrated Memory Architecture)– strategy: software controllability– addressable On-Chip Memory in addition to conventional cache

– On-Chip Memory and cache are reconfigurable– explicit data transfer between On-Chip Memory and Off-Chip Memory

by page-load/page-store instruction – burst transfer and stride transfer are supported

Address Space

On-Chip Memory Features Tb T l T t

software controllability - - ↓

page-load/page-store(burst) ↑ ↓ -

page-load/page-store(stride) ↑ ↓ ↓

scheduling for page-load/page-store - ↓ -

Latency Tolerating Techniques of Cache Tb T l T t

larger cache line - ↓ ↑lock-up free cache - ↓ ↑cache prefetching ↑ ↓ ↑

• Schematic View

advantages of SCIMA

On-ChipMem.

register

ALU FPU

Memory(DRAM)

・・・ NIA

NetworkOverview of SCIMA

cacheOn-Chip Mem.

Tb: CPU busy timeTl: Latency stall timeTt: Throughput stall time

L1 cache

Page 2: SCIMA: Software Controlled Integrated Memory Architecture for HPC

SCIMA: Experimental Results• SCIMA provides various data placement and utilization

scheme according to the characteristics of data access

• Evaluation Results

FT

QCD

Throughput ratio=2:1Latency=40 future technology

trend

Throughput ratio=8:1Latency=160

latency-stall reduction by burst transfer

throughput-stall reduction by software controllabilitylatency & throughput-stall reduction by stride transfer

:::

Throughput ratio=2:1Latency=40 future technology

trend

Throughput ratio=8:1Latency=160

Latency/Throughput stall is reduced for wide variety of data access

SCIMA is robust to large throughput ratio and long memory access latency caused by current technology trend of CPU-memory speed

gap

0.0E+001.0E+072.0E+073.0E+074.0E+075.0E+076.0E+077.0E+078.0E+07

Cache SCIMA Cache SCIMA

64K 512K

Exec

utio

n C

ycle

s throughput stalllatency stallCPU busy time

0.0E+002.0E+074.0E+076.0E+078.0E+071.0E+081.2E+081.4E+081.6E+08

Cache SCIMA Cache SCIMA

64K 512K

Exec

utio

n C

ycle

s

0.0E+00

5.0E+07

1.0E+08

1.5E+08

2.0E+08

2.5E+08

3.0E+08

Cache SCIMA Cache SCIMA

64K 512K

Exec

utio

n C

ycle

s throughput stalllatency stallCPU busy time

0.0E+00

1.0E+08

2.0E+08

3.0E+08

4.0E+08

5.0E+08

6.0E+08

Cache SCIMA Cache SCIMA

64K 512K

Exec

utio

n C

ycle

s

consecutive-ness

reusabilityreusablenot-reusable

consecutive

irregular

stride

use cache

use On-Chip Mem. as a stream buffer

use On-Chip Mem. as a stream buffer

reserve On-Chip   Mem. for reused data

reserve On-Chip Mem. for reused datareserve On-Chip Mem. for reused data

Throughput Ratio = Ratio between on-chip and off-chip memory throughputLatency = Memory access latency for off-chip memory (latency for the first data)