HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

PIs: Alvin M. Despain and Jean-Luc Gaudiot

USCUNIVERSITY

OF SOUTHERN

CALIFORNIA

University of Southern Californiahttp://www-pdpc.usc.edu

May 2001

2USC Parallel and Distributed Processing Center

Outline

HiDISC Project Description Experiments and Accomplishments Work in Progress Summary


HiDISC: Hierarchical Decoupled Instruction Set Computer

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness

Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)

Problem: Some SPEC benchmarks spend more than half of their time stalling

Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs

Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading


Present Solutions

SolutionLarger Caches

Hardware Prefetching

Software Prefetching

Multithreading

Limitations— Slow

— Works well only if working set fits cache and there is temporal locality.

— Cannot be tailored for each application

— Behavior based on past and present execution-time behavior

— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching

— Adaptive software prefetching is required to change prefetch distance during run-time

— Hard to insert prefetches for irregular access patterns

— Solves the throughput problem, not the memory latency problem


Observation:

• Software prefetching impacts compute performance

• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching

The HiDISC Approach

Approach:

• Add a processor to manage prefetching -> hide overhead

• Compiler explicitly manages the memory hierarchy

• Prefetch distance adapts to the program runtime behavior


Cache

What is HiDISC?

A dedicated processor for each level of the memory hierarchy

Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)

Exploit instruction-level parallelism without extensive scheduling hardware

Zero overhead prefetches for maximal computation throughput

Access Instructions

ComputationInstructions

Access Processor(AP)

Access Processor(AP)

ComputationProcessor (CP)

ComputationProcessor (CP)

Registers

Cache Mgmt. Processor

(CMP)

Cache Mgmt. Processor

(CMP)

Cache Mgmt.Instructions

CompilerProgram

2nd-Level Cache and Main Memory


MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)


Registers

5-issue

Cache

Access Processor(AP) - (3-issue)


2-issue

Cache Mgmt. Processor (CMP)

Registers



ComputationProcessor (CP)Computation

Processor (CP)

3-issue


Registers

8-issue

Registers

3-issue 3-issueCache Mgmt. Processor (CMP)

DEAP: [Kurian, Hulina, & Caraor ‘94]

PIPE: [Goodman ‘85]

Other Decoupled Processors: ACRI, ZS-1, WM

Cache





Processor (CP)


Processor (CP)


Processor (CP)

Cache Cache

Decoupled Architectures

SCQ

LQSDQSAQ


Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically

• Late prefetches = prefetched data arrived after load had been issued

• Useful prefetches = prefetched data arrived before load had been issued

if (prefetch_buffer_full ())Don’t change size of SCQ;

else if ((2*late_prefetches) > useful_prefetches)

Increase size of SCQ;else

Decrease size of SCQ;


Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)

for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);

while (not EOD)y = y + (x * h);

send y to SDQ

for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;

}send (EOD token)send address of y[i] to SAQ

for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;

}

Inner Loop Convolution

Computation Processor Code

Access Processor Code

Cache Management Code

SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data


Where We Were

HiDISC compiler• Frontend selection (Gcc)• Single thread running without conditionals

Hand compiling of benchmarks• Livermore loops, Tomcatv, MXM, Cholsky, Vpenta

and Qsort


Benchmarks

Benchmarks Source of Benchmark

Lines of Source Code

Description Data Set Size

LLL1 Livermore Loops [45]

20 1024-element arrays, 100 iterations

24 KB

LLL2 Livermore Loops


16 KB



16 KB



16 KB



24 KB

Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations

<64 KB

MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations

448 KB

CHOLSKY NAS kernels 156 Cholsky matrix decomposition

724 KB

VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously

128 KB

Qsort Quicksort sorting algorithm [14]

58 Quicksort 128 KB


Parameter Value Parameter Value

L1 cache size 4 KB L2 cache size 16 KB

L1 cache associativity 2 L2 cache associativity 2

L1 cache block size 32 B L2 cache block size 32 B

Memory Latency Variable, (0-200 cycles) Memory contention time Variable

Victim cache size 32 entries Prefetch buffer size 8 entries

Load queue size 128 Store address queue size 128

Store data queue size 128 Total issue width 8

Simulation Parameters


Simulation Results

0

1

2

3

4

5

0 40 80 120 160 200Main Memory Latency

LLL3

MIPSDEAPCAPP

HiDISC

0

0.5

1

1.5

2

2.5

3


Tomcatv

MIPSDEAPCAPP

HiDISC

02468

10121416


Cholsky

MIPSDEAPCAPP

HiDISC

0

2

4

6

8

10

12


Vpenta

MIPSDEAP

CAPP

HiDISC


Accomplishments

2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor

7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD)

2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor

Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)

Allows the compiler to solve indexing functions for irregular applications

Reduced system cost for high-throughput scientific codes


Work in Progress

Silicon Space for VLSI Layout Compiler Progress Simulator Integration Hand-compiling of Benchmarks Architectural Enhancement for Data Intensive

Applications


VLSI Layout Overhead

Goal: Evaluate layout effectiveness of HiDISC architecture

Cache has become a major portion of the chip area Methodology: Extrapolate HiDISC VLSI Layout based

on MIPS10000 processor (0.35 μm, 1996) The space overhead is 11.3% over a comparable MIPS

processor


VLSI Layout Overhead

Component Original MIPS R10K(0.35 m)

Extrapolation (0.15 m)

HiDISC (0.15 m)

D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2

I-Cache (32KB) 28 mm2 7 mm2 14 mm2

TLB Part 10 mm2 2.5 mm2 2.5 mm2

External Interface Unit 27 mm2 6.8 mm2 6.8 mm2

Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2

Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2

Instruction Queue 28 mm2 7 mm2 0 mm2

Reorder Buffer 17 mm2 4.3 mm2 0 mm2

Integer Functional Unit 20 mm2 5 mm2 15 mm2

FP Functional Units 24 mm2 6 mm2 6 mm2

Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2

Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2

Total Size with on chip L2 Cache 129.2 mm2 143.9 mm2


Compiler Progress

Preprocessor- Support for library calls• Gnu pre-processor

- Support for special compiler directives(#include, #define)

Nested Loops • Nested loops without data dependencies

(for and while)• Support for conditional statements where the index variable

of an inner loop is not a function of some outer loop computation

Conditional statements• CP to perform all computations• Need to move the condition to AP


# 6 for(i=0;i<10;++i)sw $0, 12($sp)

$32:

# 7 for(k=0;k<10;++k)

sw $0, 4($sp)

$33:

# 8 ++j;

lw $15, 8($sp)

addu $24, $15, 1

sw $24, 8($sp)

lw $25, 4($sp)

addu $8, $25, 1

sw $8, 4($sp)

blt $8, 10, $33

lw $9, 12($sp)

addu $10, $9, 1

sw $10, 12($sp)

blt $10, 10, $32

Nested Loops (Assembly)

Assembly Code

for (i=0;i<10;++i)

for(k=0;k<10;++k)

++j;

High Level Code - C


Nested Loops

# 6 for(i=0;i<10;++i)

$32:

b_eod loop_i

# 7 for(k=0;k<10;++k)

$33:

b_eod loop_k

# 8 ++j;

addu $15, LQ, 1

sw $15, SDQ

b $33

Loop_k :

b $32

loop_i :

# 6 for(i=0;i<10;++i)

sw $0, 12($sp)

$32:

# 7 for(k=0;k<10;++k)

sw $0, 4($sp)

$33:

# 8 ++j;

lw LQ, 8($sp)

get SCQ

sw 8($sp), SAQ

lw $25, 4($sp)

addu $8, $25, 1

sw $8, 4($sp)

blt $8, 10, $33

s_eod

lw $9, 12($sp)

addu $10, $9, 1

sw $10, 12($sp)

blt $10, 10, $32

s_eod

# 6 for(i=0;i<10;++i)

sw $0, 12($sp)

$32:

# 7 for(k=0;k<10;++k)

sw $0, 4($sp)

$33:

# 8 ++j;

pref 8($sp)

put SCQ

lw $25, 4($sp)

addu $8, $25, 1

sw $8, 4($sp)

blt $8, 10, $33

lw $9, 12($sp)

addu $10, $9, 1

sw $10, 12($sp)

blt $10, 10, $32

CP Stream AP Stream CMP Stream


HiDISC Compiler

Gcc Backend• Use input from the parsing phase for performing

loop optimizations• Extend the compiler to provide MIPS4• Handle loops with dependencies

– Nested loops where the computation depends on the indices of more than 1 loop

e.g. X(i,j) = i*Y(j,i)

where i and j are index variables and j is a function of i


HiDISC Stream SeparatorSequential Source Sequential Source

Program Flow GraphProgram Flow Graph

Classify Address RegistersClassify Address Registers

Allocate Instruction to streamsAllocate Instruction to streams

Access Stream

Access Stream

Computation Stream

Computation Stream

Fix Conditional StatementsFix Conditional Statements

Move Queue Access into InstructionsMove Queue Access into Instructions

Move Loop Invariants out of the loopMove Loop Invariants out of the loop

Substitute Prefetches for Loads, Remove

global Stores, andReverse SCQ Direction

Substitute Prefetches for Loads, Remove

global Stores, andReverse SCQ Direction

Add global data Communication and SynchronizationAdd global data Communication and Synchronization

Produce Assembly codeProduce Assembly code

ComputationAssembly CodeComputation

Assembly Code

AccessAssembly Code

AccessAssembly Code

Cache ManagementAssembly Code

Cache ManagementAssembly Code

Previous Work

Current Work

Add Slip Control Queue InstructionsAdd Slip Control Queue Instructions


Simulator Integration

Based on MIPS RISC pipeline Architecture (dsim)

Supports MIPS1 and MIPS2 ISA

Supports dynamic linking for shared library• Loading shared library in the

simulator Hand compiling:

• Using SGI-cc or gcc as front-end

• Making 3 streams of codes• Using SGI-cc as compiler

back-end

cc –mips2 -s

.c

dsim

.s

.cp.s .ap.s .cmp.s

cc –mips2 -o

.cp .ap .cmpsharedlibrary

• Modification on three .s codes to HiDISC assembly• Convert .hs to .s for three codes (hs2s)


DIS Benchmarks Atlantic Aerospace DIS benchmark suite:

• Application oriented benchmarks

• Many defense applications employ large data sets – non contiguous memory access and no temporal locality

• Too large for hand-compiling, wait until the compiler ready

• Requires linker that can handle multiple object files

Atlantic Aerospace Stressmarks suite:• Smaller and more specific procedures

• Seven individual data intensive benchmarks

• Directly illustrate particular elements of the DIS problem


Stressmark Suite

Stressmark Problem Memory Access

Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized

Update Pointer following with memory update

Small blocks at unpredictable location

Matrix Conjugate gradient simultaneous equation solver

Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse

Neighborhood Calculate image texture measures by finding sum and difference histograms

Regular access to pairs of words at arbitrary distances

Field Collect statistics on large field of words

Regular, with little re-use

Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation

Transitive Closure Find all-pairs-shortest-path solution for a directed graph

Dependent on matrix representation, but requires reads and writes to different matrices concurrently* DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division


Example of Stressmarks

Pointer Stressmark• Basic idea: repeatedly follow pointers to randomized

locations in memory

• Memory access pattern is unpredictable

• Randomized memory access pattern:– Not sufficient temporal and spatial locality for conventional cache

architectures

• HiDISC architecture provides lower memory access latency


Decoupling of Pointer Stressmarks

for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++;}if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}

while (not EOD)if (field > partition) balance++; if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}

for (i=j+1; i<w; i++) {load (field[index+i]);GET_SCQ;

}send (EOD token)

for (i=j+1; i<w; i++) {prefetch (field[index+i]);PUT_SCQ;

}

Inner loop for the next indexing

Computation Processor Code

Access Processor Code

Cache Management Code


Stressmarks

Hand-compile the 7 individual benchmarks• Use gcc as front-end• Manually partition each of the three instruction

streams and insert synchronizing instructions Evaluate architectural trade-offs

• Updated simulator characteristics such as out-of-order issue

• Large L2 cache and enhanced main memory system such as Rambus and DDR


Architectural Enhancement for Data Intensive Applications

Enhanced memory system such as RAMBUS DRAM and DDR DRAM• Provide high memory bandwidth• Latency does not improve significantly

Decoupled access processor can fully utilize the enhanced memory bandwidth• More requests caused by access processor • Pre-fetching mechanism hide memory access latency


Flexi-DISC

Fundamental characteristics:• inherently highly dynamic at

execution time. Dynamic reconfigurable central

computational kernel (CK) Multiple levels of caching and

processing around CK• adjustable prefetching

Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel


Flexi-DISC

Partitioning of Computation Kernel• It can be allocated to the different

portions of the application or different applications

CK requires separation of the next ring to feed it with data

The variety of target applications makes the memory accesses unpredictable

Identical processing units for outer rings• Highly efficient dynamic

partitioning of the resources and their run-time allocation can be achieved

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel

Application 1

Application 2

Application 3


Summary

Gcc Backend• Use the parsing tree to extract the load instructions • Handle loops with dependencies where the index

variable of an inner loop is not a function of some outer loop computation

Robust compiler is being designed to experiment and analyze with additional benchmarks• Eventually extend it to experiment DIS benchmarks

Additional architectural enhancements have been introduced to make HiDISC amenable to DIS benchmarks


HiDISC: Hierarchical Decoupled Instruction Set Computer

New Ideas• A dedicated processor for each level of the memory hierarchy

• Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

• Hide memory latency by converting data access predictability to data access locality

• Exploit instruction-level parallelism without extensive scheduling hardware

• Zero overhead prefetches for maximal computation throughput

Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor

• 7.4x speedup for matrix multiply over an in-order issue superscalar processor

• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor

• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)

• Allows the compiler to solve indexing functions for irregular applications

• Reduced system cost for high-throughput scientific codes

Schedule

Start April 2001

• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks

• Continue simulations of more benchmarks (SAR)• Define HiDISC architecture• Benchmark result• Update Simulator

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness

• Develop and test a full decoupling compiler decoupling compiler• Generate performance statistics and evaluate design

Documents

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing