Upload
baakir
View
47
Download
0
Embed Size (px)
DESCRIPTION
USC. UNIVERSITY. OF SOUTHERN. CALIFORNIA. HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing. PIs: Alvin M. Despain and Jean-Luc Gaudiot. University of Southern California http://www-pdpc.usc.edu May 2001. Outline. HiDISC Project Description - PowerPoint PPT Presentation
Citation preview
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
PIs: Alvin M. Despain and Jean-Luc Gaudiot
USCUNIVERSITY
OF SOUTHERN
CALIFORNIA
University of Southern Californiahttp://www-pdpc.usc.edu
May 2001
2USC Parallel and Distributed Processing Center
Outline
HiDISC Project Description Experiments and Accomplishments Work in Progress Summary
3USC Parallel and Distributed Processing Center
HiDISC: Hierarchical Decoupled Instruction Set Computer
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)
Problem: Some SPEC benchmarks spend more than half of their time stalling
Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs
Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading
4USC Parallel and Distributed Processing Center
Present Solutions
SolutionLarger Caches
Hardware Prefetching
Software Prefetching
Multithreading
Limitations— Slow
— Works well only if working set fits cache and there is temporal locality.
— Cannot be tailored for each application
— Behavior based on past and present execution-time behavior
— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching
— Adaptive software prefetching is required to change prefetch distance during run-time
— Hard to insert prefetches for irregular access patterns
— Solves the throughput problem, not the memory latency problem
5USC Parallel and Distributed Processing Center
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching
The HiDISC Approach
Approach:
• Add a processor to manage prefetching -> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
6USC Parallel and Distributed Processing Center
Cache
What is HiDISC?
A dedicated processor for each level of the memory hierarchy
Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)
Exploit instruction-level parallelism without extensive scheduling hardware
Zero overhead prefetches for maximal computation throughput
Access Instructions
ComputationInstructions
Access Processor(AP)
Access Processor(AP)
ComputationProcessor (CP)
ComputationProcessor (CP)
Registers
Cache Mgmt. Processor
(CMP)
Cache Mgmt. Processor
(CMP)
Cache Mgmt.Instructions
CompilerProgram
2nd-Level Cache and Main Memory
7USC Parallel and Distributed Processing Center
MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)
2nd-Level Cache and Main Memory
Registers
5-issue
Cache
Access Processor(AP) - (3-issue)
Access Processor(AP) - (3-issue)
2-issue
Cache Mgmt. Processor (CMP)
Registers
Access Processor(AP) - (5-issue)
Access Processor(AP) - (5-issue)
ComputationProcessor (CP)Computation
Processor (CP)
3-issue
2nd-Level Cache and Main Memory
Registers
8-issue
Registers
3-issue 3-issueCache Mgmt. Processor (CMP)
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WM
Cache
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
Cache Cache
Decoupled Architectures
SCQ
LQSDQSAQ
8USC Parallel and Distributed Processing Center
Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically
• Late prefetches = prefetched data arrived after load had been issued
• Useful prefetches = prefetched data arrived before load had been issued
if (prefetch_buffer_full ())Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;else
Decrease size of SCQ;
9USC Parallel and Distributed Processing Center
Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)
for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);
while (not EOD)y = y + (x * h);
send y to SDQ
for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;
}send (EOD token)send address of y[i] to SAQ
for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;
}
Inner Loop Convolution
Computation Processor Code
Access Processor Code
Cache Management Code
SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data
10USC Parallel and Distributed Processing Center
Where We Were
HiDISC compiler• Frontend selection (Gcc)• Single thread running without conditionals
Hand compiling of benchmarks• Livermore loops, Tomcatv, MXM, Cholsky, Vpenta
and Qsort
11USC Parallel and Distributed Processing Center
Benchmarks
Benchmarks Source of Benchmark
Lines of Source Code
Description Data Set Size
LLL1 Livermore Loops [45]
20 1024-element arrays, 100 iterations
24 KB
LLL2 Livermore Loops
24 1024-element arrays, 100 iterations
16 KB
LLL3 Livermore Loops
18 1024-element arrays, 100 iterations
16 KB
LLL4 Livermore Loops
25 1024-element arrays, 100 iterations
16 KB
LLL5 Livermore Loops
17 1024-element arrays, 100 iterations
24 KB
Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations
<64 KB
MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations
448 KB
CHOLSKY NAS kernels 156 Cholsky matrix decomposition
724 KB
VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously
128 KB
Qsort Quicksort sorting algorithm [14]
58 Quicksort 128 KB
12USC Parallel and Distributed Processing Center
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles) Memory contention time Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size 128
Store data queue size 128 Total issue width 8
Simulation Parameters
13USC Parallel and Distributed Processing Center
Simulation Results
0
1
2
3
4
5
0 40 80 120 160 200Main Memory Latency
LLL3
MIPSDEAPCAPP
HiDISC
0
0.5
1
1.5
2
2.5
3
0 40 80 120 160 200Main Memory Latency
Tomcatv
MIPSDEAPCAPP
HiDISC
02468
10121416
0 40 80 120 160 200Main Memory Latency
Cholsky
MIPSDEAPCAPP
HiDISC
0
2
4
6
8
10
12
0 40 80 120 160 200Main Memory Latency
Vpenta
MIPSDEAP
CAPP
HiDISC
14USC Parallel and Distributed Processing Center
Accomplishments
2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor
7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD)
2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor
Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions for irregular applications
Reduced system cost for high-throughput scientific codes
15USC Parallel and Distributed Processing Center
Work in Progress
Silicon Space for VLSI Layout Compiler Progress Simulator Integration Hand-compiling of Benchmarks Architectural Enhancement for Data Intensive
Applications
16USC Parallel and Distributed Processing Center
VLSI Layout Overhead
Goal: Evaluate layout effectiveness of HiDISC architecture
Cache has become a major portion of the chip area Methodology: Extrapolate HiDISC VLSI Layout based
on MIPS10000 processor (0.35 μm, 1996) The space overhead is 11.3% over a comparable MIPS
processor
17USC Parallel and Distributed Processing Center
VLSI Layout Overhead
Component Original MIPS R10K(0.35 m)
Extrapolation (0.15 m)
HiDISC (0.15 m)
D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2
I-Cache (32KB) 28 mm2 7 mm2 14 mm2
TLB Part 10 mm2 2.5 mm2 2.5 mm2
External Interface Unit 27 mm2 6.8 mm2 6.8 mm2
Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2
Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2
Instruction Queue 28 mm2 7 mm2 0 mm2
Reorder Buffer 17 mm2 4.3 mm2 0 mm2
Integer Functional Unit 20 mm2 5 mm2 15 mm2
FP Functional Units 24 mm2 6 mm2 6 mm2
Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2
Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2
Total Size with on chip L2 Cache 129.2 mm2 143.9 mm2
18USC Parallel and Distributed Processing Center
Compiler Progress
Preprocessor- Support for library calls• Gnu pre-processor
- Support for special compiler directives(#include, #define)
Nested Loops • Nested loops without data dependencies
(for and while)• Support for conditional statements where the index variable
of an inner loop is not a function of some outer loop computation
Conditional statements• CP to perform all computations• Need to move the condition to AP
19USC Parallel and Distributed Processing Center
# 6 for(i=0;i<10;++i)sw $0, 12($sp)
$32:
# 7 for(k=0;k<10;++k)
sw $0, 4($sp)
$33:
# 8 ++j;
lw $15, 8($sp)
addu $24, $15, 1
sw $24, 8($sp)
lw $25, 4($sp)
addu $8, $25, 1
sw $8, 4($sp)
blt $8, 10, $33
lw $9, 12($sp)
addu $10, $9, 1
sw $10, 12($sp)
blt $10, 10, $32
Nested Loops (Assembly)
Assembly Code
for (i=0;i<10;++i)
for(k=0;k<10;++k)
++j;
High Level Code - C
20USC Parallel and Distributed Processing Center
Nested Loops
# 6 for(i=0;i<10;++i)
$32:
b_eod loop_i
# 7 for(k=0;k<10;++k)
$33:
b_eod loop_k
# 8 ++j;
addu $15, LQ, 1
sw $15, SDQ
b $33
Loop_k :
b $32
loop_i :
# 6 for(i=0;i<10;++i)
sw $0, 12($sp)
$32:
# 7 for(k=0;k<10;++k)
sw $0, 4($sp)
$33:
# 8 ++j;
lw LQ, 8($sp)
get SCQ
sw 8($sp), SAQ
lw $25, 4($sp)
addu $8, $25, 1
sw $8, 4($sp)
blt $8, 10, $33
s_eod
lw $9, 12($sp)
addu $10, $9, 1
sw $10, 12($sp)
blt $10, 10, $32
s_eod
# 6 for(i=0;i<10;++i)
sw $0, 12($sp)
$32:
# 7 for(k=0;k<10;++k)
sw $0, 4($sp)
$33:
# 8 ++j;
pref 8($sp)
put SCQ
lw $25, 4($sp)
addu $8, $25, 1
sw $8, 4($sp)
blt $8, 10, $33
lw $9, 12($sp)
addu $10, $9, 1
sw $10, 12($sp)
blt $10, 10, $32
CP Stream AP Stream CMP Stream
21USC Parallel and Distributed Processing Center
HiDISC Compiler
Gcc Backend• Use input from the parsing phase for performing
loop optimizations• Extend the compiler to provide MIPS4• Handle loops with dependencies
– Nested loops where the computation depends on the indices of more than 1 loop
e.g. X(i,j) = i*Y(j,i)
where i and j are index variables and j is a function of i
22USC Parallel and Distributed Processing Center
HiDISC Stream SeparatorSequential Source Sequential Source
Program Flow GraphProgram Flow Graph
Classify Address RegistersClassify Address Registers
Allocate Instruction to streamsAllocate Instruction to streams
Access Stream
Access Stream
Computation Stream
Computation Stream
Fix Conditional StatementsFix Conditional Statements
Move Queue Access into InstructionsMove Queue Access into Instructions
Move Loop Invariants out of the loopMove Loop Invariants out of the loop
Substitute Prefetches for Loads, Remove
global Stores, andReverse SCQ Direction
Substitute Prefetches for Loads, Remove
global Stores, andReverse SCQ Direction
Add global data Communication and SynchronizationAdd global data Communication and Synchronization
Produce Assembly codeProduce Assembly code
ComputationAssembly CodeComputation
Assembly Code
AccessAssembly Code
AccessAssembly Code
Cache ManagementAssembly Code
Cache ManagementAssembly Code
Previous Work
Current Work
Add Slip Control Queue InstructionsAdd Slip Control Queue Instructions
23USC Parallel and Distributed Processing Center
Simulator Integration
Based on MIPS RISC pipeline Architecture (dsim)
Supports MIPS1 and MIPS2 ISA
Supports dynamic linking for shared library• Loading shared library in the
simulator Hand compiling:
• Using SGI-cc or gcc as front-end
• Making 3 streams of codes• Using SGI-cc as compiler
back-end
cc –mips2 -s
.c
dsim
.s
.cp.s .ap.s .cmp.s
cc –mips2 -o
.cp .ap .cmpsharedlibrary
• Modification on three .s codes to HiDISC assembly• Convert .hs to .s for three codes (hs2s)
24USC Parallel and Distributed Processing Center
DIS Benchmarks Atlantic Aerospace DIS benchmark suite:
• Application oriented benchmarks
• Many defense applications employ large data sets – non contiguous memory access and no temporal locality
• Too large for hand-compiling, wait until the compiler ready
• Requires linker that can handle multiple object files
Atlantic Aerospace Stressmarks suite:• Smaller and more specific procedures
• Seven individual data intensive benchmarks
• Directly illustrate particular elements of the DIS problem
25USC Parallel and Distributed Processing Center
Stressmark Suite
Stressmark Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized
Update Pointer following with memory update
Small blocks at unpredictable location
Matrix Conjugate gradient simultaneous equation solver
Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse
Neighborhood Calculate image texture measures by finding sum and difference histograms
Regular access to pairs of words at arbitrary distances
Field Collect statistics on large field of words
Regular, with little re-use
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation
Transitive Closure Find all-pairs-shortest-path solution for a directed graph
Dependent on matrix representation, but requires reads and writes to different matrices concurrently* DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division
26USC Parallel and Distributed Processing Center
Example of Stressmarks
Pointer Stressmark• Basic idea: repeatedly follow pointers to randomized
locations in memory
• Memory access pattern is unpredictable
• Randomized memory access pattern:– Not sufficient temporal and spatial locality for conventional cache
architectures
• HiDISC architecture provides lower memory access latency
27USC Parallel and Distributed Processing Center
Decoupling of Pointer Stressmarks
for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++;}if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}
while (not EOD)if (field > partition) balance++; if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}
for (i=j+1; i<w; i++) {load (field[index+i]);GET_SCQ;
}send (EOD token)
for (i=j+1; i<w; i++) {prefetch (field[index+i]);PUT_SCQ;
}
Inner loop for the next indexing
Computation Processor Code
Access Processor Code
Cache Management Code
28USC Parallel and Distributed Processing Center
Stressmarks
Hand-compile the 7 individual benchmarks• Use gcc as front-end• Manually partition each of the three instruction
streams and insert synchronizing instructions Evaluate architectural trade-offs
• Updated simulator characteristics such as out-of-order issue
• Large L2 cache and enhanced main memory system such as Rambus and DDR
29USC Parallel and Distributed Processing Center
Architectural Enhancement for Data Intensive Applications
Enhanced memory system such as RAMBUS DRAM and DDR DRAM• Provide high memory bandwidth• Latency does not improve significantly
Decoupled access processor can fully utilize the enhanced memory bandwidth• More requests caused by access processor • Pre-fetching mechanism hide memory access latency
30USC Parallel and Distributed Processing Center
Flexi-DISC
Fundamental characteristics:• inherently highly dynamic at
execution time. Dynamic reconfigurable central
computational kernel (CK) Multiple levels of caching and
processing around CK• adjustable prefetching
Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
31USC Parallel and Distributed Processing Center
Flexi-DISC
Partitioning of Computation Kernel• It can be allocated to the different
portions of the application or different applications
CK requires separation of the next ring to feed it with data
The variety of target applications makes the memory accesses unpredictable
Identical processing units for outer rings• Highly efficient dynamic
partitioning of the resources and their run-time allocation can be achieved
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
Application 1
Application 2
Application 3
32USC Parallel and Distributed Processing Center
Summary
Gcc Backend• Use the parsing tree to extract the load instructions • Handle loops with dependencies where the index
variable of an inner loop is not a function of some outer loop computation
Robust compiler is being designed to experiment and analyze with additional benchmarks• Eventually extend it to experiment DIS benchmarks
Additional architectural enhancements have been introduced to make HiDISC amenable to DIS benchmarks
33USC Parallel and Distributed Processing Center
HiDISC: Hierarchical Decoupled Instruction Set Computer
New Ideas• A dedicated processor for each level of the memory hierarchy
• Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
• Hide memory latency by converting data access predictability to data access locality
• Exploit instruction-level parallelism without extensive scheduling hardware
• Zero overhead prefetches for maximal computation throughput
Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor
• 7.4x speedup for matrix multiply over an in-order issue superscalar processor
• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor
• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)
• Allows the compiler to solve indexing functions for irregular applications
• Reduced system cost for high-throughput scientific codes
Schedule
Start April 2001
• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks
• Continue simulations of more benchmarks (SAR)• Define HiDISC architecture• Benchmark result• Update Simulator
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
• Develop and test a full decoupling compiler decoupling compiler• Generate performance statistics and evaluate design