View
17
Download
3
Category
Preview:
DESCRIPTION
T b : CPU busy time T l : Latency stall time T t : Throughput stall time. NIA. ・・・. Memory. (DRAM). Network. SCIMA: Software Controlled Integrated Memory Architecture for HPC. Background Memory wall problem Conventional Cache is not good in HPC unwilling line conflict - PowerPoint PPT Presentation
Citation preview
SCIMA: Software Controlled Integrated Memory Architecture for HPC
• Background– Memory wall problem– Conventional Cache is not good in HPC
– unwilling line conflict– fixed size of Off-Chip Memory access
• Solution: SCIMA (Software Controlled Integrated Memory Architecture)– strategy: software controllability– addressable On-Chip Memory in addition to conventional cache
– On-Chip Memory and cache are reconfigurable– explicit data transfer between On-Chip Memory and Off-Chip Memory
by page-load/page-store instruction – burst transfer and stride transfer are supported
Address Space
On-Chip Memory Features Tb T l T t
software controllability - - ↓
page-load/page-store(burst) ↑ ↓ -
page-load/page-store(stride) ↑ ↓ ↓
scheduling for page-load/page-store - ↓ -
Latency Tolerating Techniques of Cache Tb T l T t
larger cache line - ↓ ↑lock-up free cache - ↓ ↑cache prefetching ↑ ↓ ↑
• Schematic View
advantages of SCIMA
On-ChipMem.
register
ALU FPU
Memory(DRAM)
・・・ NIA
NetworkOverview of SCIMA
cacheOn-Chip Mem.
Tb: CPU busy timeTl: Latency stall timeTt: Throughput stall time
L1 cache
SCIMA: Experimental Results• SCIMA provides various data placement and utilization
scheme according to the characteristics of data access
• Evaluation Results
FT
QCD
Throughput ratio=2:1Latency=40 future technology
trend
Throughput ratio=8:1Latency=160
latency-stall reduction by burst transfer
throughput-stall reduction by software controllabilitylatency & throughput-stall reduction by stride transfer
:::
Throughput ratio=2:1Latency=40 future technology
trend
Throughput ratio=8:1Latency=160
Latency/Throughput stall is reduced for wide variety of data access
SCIMA is robust to large throughput ratio and long memory access latency caused by current technology trend of CPU-memory speed
gap
0.0E+001.0E+072.0E+073.0E+074.0E+075.0E+076.0E+077.0E+078.0E+07
Cache SCIMA Cache SCIMA
64K 512K
Exec
utio
n C
ycle
s throughput stalllatency stallCPU busy time
0.0E+002.0E+074.0E+076.0E+078.0E+071.0E+081.2E+081.4E+081.6E+08
Cache SCIMA Cache SCIMA
64K 512K
Exec
utio
n C
ycle
s
0.0E+00
5.0E+07
1.0E+08
1.5E+08
2.0E+08
2.5E+08
3.0E+08
Cache SCIMA Cache SCIMA
64K 512K
Exec
utio
n C
ycle
s throughput stalllatency stallCPU busy time
0.0E+00
1.0E+08
2.0E+08
3.0E+08
4.0E+08
5.0E+08
6.0E+08
Cache SCIMA Cache SCIMA
64K 512K
Exec
utio
n C
ycle
s
consecutive-ness
reusabilityreusablenot-reusable
consecutive
irregular
stride
use cache
use On-Chip Mem. as a stream buffer
use On-Chip Mem. as a stream buffer
reserve On-Chip Mem. for reused data
reserve On-Chip Mem. for reused datareserve On-Chip Mem. for reused data
Throughput Ratio = Ratio between on-chip and off-chip memory throughputLatency = Memory access latency for off-chip memory (latency for the first data)
Recommended