14907_sharc

7/31/2019 14907_sharc

1/26

Clare Smtih SHARC Presentation 1

The SHARC

Super Harvard Architecture Computer

7/31/2019 14907_sharc

2/26

Clare Smtih SHARC Presentation2

The SHARC

Developed by Analog Devices

Optimized for demanding DSP and imaging

applications.

32 Bit floating point, with 40 bit extended

floating point capabilities.

Large on-chip memory.

Ideal for scalable multi-processing

applications.

7/31/2019 14907_sharc

3/26

3

Harvard Architecture

Program memory can store data.

Able to simultaneously read or write data at

one location and get instructions from

another place in memory.

2 buses1 Data memory bus.

2 Program bus.

Either two separate memories or a single

dual-port memory.

7/31/2019 14907_sharc

4/26


Super Harvard Architecture

Many processor employ Harvard

Architecture by having two separate

memories or caches integrated into the

processor chip

The SHARC is unique in that its internal

memory is capable of holding a largeprogram as well a large amount of data.

This is what makes it SUPER!!!

7/31/2019 14907_sharc

5/26


DSP

Digital Signal Processor.

High speed, low overhead data movement

and rapid computations required.

Usually has a small on-board ROM, RAM

and single cycle multiply.

Designed to run single line, serial in, serial

out, signal processing applications very fast.

7/31/2019 14907_sharc

6/26


DSP Computations

The inner product of two vectors is a

common computation for determining

energy or correlation.

The following C code is an example:

for (n=0; n

7/31/2019 14907_sharc

7/26

7/31/2019 14907_sharc

8/26

8

Floating Point and

Extended Floating Point

The SHARC supports floating, extended-

floating and non-floating point.

No additional clock cycles for floating point

computations.

Data automatically truncated and zero

padded when moved between 32-bit

memory and internal registers.

Not accurate enough for scientific

algorithms. Excellent signal to noise ratio.

7/31/2019 14907_sharc

9/26

9

SHARCs Internal Memory

Makes SHARC unique.

Size Allows many complex functions to be preformed

on-chip. Eliminating the need to move data between

internal and external memory.

Memory size is significantly larger then most other

high speed computational devices.

Dual-block, Dual-port Optimizes the Harvard Architecture by allowing the

fetch of instructions while performing data memoryaccesses.

7/31/2019 14907_sharc

10/26

10

Multiply and Accumulate

Instructions on the SHARC

Like most DSPs the SHARC is able to

compute a product and add the product to a

running total in a single clock cycle.

The SHARCs super instruction is that it

can multiply and accumulate while adding,

subtracting, or averaging data in two otherregisters.

These instructions give the SHARC its 120

megaflop rating.

7/31/2019 14907_sharc

11/26

11

Zero Overhead Looping

on the SHARC

A single instruction outside the loop

performs loop set-up. Informing the

SHARC that there is a loop approaching.

The instruction also includes the iteration

count and termination condition.

This causes the pipeline to remain full

during loop execution and also allows the

termination condition to be tested in

parallel.

7/31/2019 14907_sharc

12/26

12

DAGs on the SHARC

Data Address Generators are integer

computation units that manage the indexing

of registers.

Allows the SHARC to to fetch a value and

update the index value.

If the updated value exceeds a limit, the

DAB adjusts the index so that it wraps.

This occurs in the same clock cycle as the

read or write.

7/31/2019 14907_sharc

13/26


DAG Capabilities

Circular Buffering Rather then actually moving data in and out of a

vector, circular buffers are used. Updating the index modulo, the oldest entry can be

conveniently replaced by the newest entry.

Bit Reverse Addressing The bit pattern of a vector index is reversed.

Done automatically by the SHARC.

Required for Fast Fourier Transform (FFT), which

is often critical to DSP applications.

7/31/2019 14907_sharc

14/26


SHARC DSP

What Makes the SHARC unique?

It also has some features not related directly

related to optimizing numeric computations.

Pipelining

Handling Branches

Why has this not emerged sooner?Technology has only recently become available

to make it economical to integrate general

single computing devices.

7/31/2019 14907_sharc

15/26


SHARCs Pipeline

3 stages1 Instruction Fetch

2 Decode

3 Execution

Takes three clock cycles for an instruction

to propagate through the pipeline. The processor execution speed is one

instruction per clock cycle even though

each instruction requires three clock cycles.

7/31/2019 14907_sharc

16/26

16

SHARCs Handling Branches

Delayed Branching

When a branch instruction is encountered

the two instructions which have been loaded

and decoded are executed before the branch.

This keeps the pipeline full and avoids

junking those two instructions and reloading

the pipeline.

Beneficial in situations such as a few

instruction loops. When the ratio of wasted

clock cycles to instructions is significant.

7/31/2019 14907_sharc

17/26


SHARCs Handling Branches

Non-delayed Branching

Traditional branching.

If the pipeline cannot be reordered to use

delayed branching, non-delayed branching

is space saving.

Uses only one word of storage.

Although, it takes three cycles as the

pipeline gets reloaded.

7/31/2019 14907_sharc

18/26


Multi-processing

SHARC is uniquely equipped for multi-

processing.

Links to ports are very powerful multi-

processing capabilities.

Two main program models depending on

the application.

Adapts well to different multi-processing

architectures.

7/31/2019 14907_sharc

19/26


Multi-processingSHARC Links

SHARC has 6 link ports that can transport

data at rates up to 40Mbytes/sec.

Links designed for point-to-point

connections.

Data can be transmitted in either direction

but not both simultaneously.

7/31/2019 14907_sharc

20/26


Multi-processing Program Model

MIMD

Multiple instruction, multiple data.

Good for applications that require multiple

instruction threads to execute concurrently.

Processors operate individually. Each processor executes different code.

Typically used for image reconstruction and

multi-channel DSP.

7/31/2019 14907_sharc

21/26


Multi-processing Program ModelSIMD

Single instruction, multiple data.

Works best when all processors execute

identical instruction sequences.

Do not require overhead for inter-processor

synchronization.

Typically used for synthetic aperture radar

and automatic target recognition.

7/31/2019 14907_sharc

22/26


Multi-processing ArchitecturesCluster Design

Groups of up to 6 in a cluster

Most common for joining multiple

SAHRC's

All processors, global I/O and global

memory connected to a common

Cluster bus.

Each SHARC can drive the bus.

7/31/2019 14907_sharc

23/26

23

Multi-processing ArchitecturesMesh Design

All SHARCs joined by their link ports and

are connected to a common bus.

In SIMD mode one single master SHARC

drives the bus.

In MIMD mode mesh architecture cannot

function if data is lager then on chip

available memory.

Advantageous scalability over a wider range

of applications.

7/31/2019 14907_sharc

24/26


Summary of what makes the

SHARCSuper

It performs excellently for DSP

applications.

Employs a Harvard Architecture with very

large on chip memory.

Respectable Megaflop rating.

Its multiprocessing capabilities.

7/31/2019 14907_sharc

25/26


How optimal is the SHARC for

non-DSP Applications?

It is obviously geared for DSP applications.

While it may fare better then other

processors it is still behind those which are

designed specifically for non-DSP

applications.

7/31/2019 14907_sharc

26/26


Sources

www.alacron.com/news/tp_mimd_simd.htm

www.analog.com

www.cs.seas.gwu.edu/~cs339/cs339-

lecture2.pdf

www.ixthos.aa.psiweb.com/technical/notes

_articles/articles

Documents

14907_sharc