14907_sharc

Embed Size (px)

Citation preview

  • 7/31/2019 14907_sharc

    1/26

    Clare Smtih SHARC Presentation 1

    The SHARC

    Super Harvard Architecture Computer

  • 7/31/2019 14907_sharc

    2/26

    Clare Smtih SHARC Presentation2

    The SHARC

    Developed by Analog Devices

    Optimized for demanding DSP and imaging

    applications.

    32 Bit floating point, with 40 bit extended

    floating point capabilities.

    Large on-chip memory.

    Ideal for scalable multi-processing

    applications.

  • 7/31/2019 14907_sharc

    3/26

    3

    Harvard Architecture

    Program memory can store data.

    Able to simultaneously read or write data at

    one location and get instructions from

    another place in memory.

    2 buses1 Data memory bus.

    2 Program bus.

    Either two separate memories or a single

    dual-port memory.

  • 7/31/2019 14907_sharc

    4/26

    Clare Smtih SHARC Presentation4

    Super Harvard Architecture

    Many processor employ Harvard

    Architecture by having two separate

    memories or caches integrated into the

    processor chip

    The SHARC is unique in that its internal

    memory is capable of holding a largeprogram as well a large amount of data.

    This is what makes it SUPER!!!

  • 7/31/2019 14907_sharc

    5/26

    Clare Smtih SHARC Presentation5

    DSP

    Digital Signal Processor.

    High speed, low overhead data movement

    and rapid computations required.

    Usually has a small on-board ROM, RAM

    and single cycle multiply.

    Designed to run single line, serial in, serial

    out, signal processing applications very fast.

  • 7/31/2019 14907_sharc

    6/26

    Clare Smtih SHARC Presentation6

    DSP Computations

    The inner product of two vectors is a

    common computation for determining

    energy or correlation.

    The following C code is an example:

    for (n=0; n

  • 7/31/2019 14907_sharc

    7/26

  • 7/31/2019 14907_sharc

    8/26

    8

    Floating Point and

    Extended Floating Point

    The SHARC supports floating, extended-

    floating and non-floating point.

    No additional clock cycles for floating point

    computations.

    Data automatically truncated and zero

    padded when moved between 32-bit

    memory and internal registers.

    Not accurate enough for scientific

    algorithms. Excellent signal to noise ratio.

  • 7/31/2019 14907_sharc

    9/26

    9

    SHARCs Internal Memory

    Makes SHARC unique.

    Size Allows many complex functions to be preformed

    on-chip. Eliminating the need to move data between

    internal and external memory.

    Memory size is significantly larger then most other

    high speed computational devices.

    Dual-block, Dual-port Optimizes the Harvard Architecture by allowing the

    fetch of instructions while performing data memoryaccesses.

  • 7/31/2019 14907_sharc

    10/26

    10

    Multiply and Accumulate

    Instructions on the SHARC

    Like most DSPs the SHARC is able to

    compute a product and add the product to a

    running total in a single clock cycle.

    The SHARCs super instruction is that it

    can multiply and accumulate while adding,

    subtracting, or averaging data in two otherregisters.

    These instructions give the SHARC its 120

    megaflop rating.

  • 7/31/2019 14907_sharc

    11/26

    11

    Zero Overhead Looping

    on the SHARC

    A single instruction outside the loop

    performs loop set-up. Informing the

    SHARC that there is a loop approaching.

    The instruction also includes the iteration

    count and termination condition.

    This causes the pipeline to remain full

    during loop execution and also allows the

    termination condition to be tested in

    parallel.

  • 7/31/2019 14907_sharc

    12/26

    12

    DAGs on the SHARC

    Data Address Generators are integer

    computation units that manage the indexing

    of registers.

    Allows the SHARC to to fetch a value and

    update the index value.

    If the updated value exceeds a limit, the

    DAB adjusts the index so that it wraps.

    This occurs in the same clock cycle as the

    read or write.

  • 7/31/2019 14907_sharc

    13/26

    Clare Smtih SHARC Presentation13

    DAG Capabilities

    Circular Buffering Rather then actually moving data in and out of a

    vector, circular buffers are used. Updating the index modulo, the oldest entry can be

    conveniently replaced by the newest entry.

    Bit Reverse Addressing The bit pattern of a vector index is reversed.

    Done automatically by the SHARC.

    Required for Fast Fourier Transform (FFT), which

    is often critical to DSP applications.

  • 7/31/2019 14907_sharc

    14/26

    Clare Smtih SHARC Presentation14

    SHARC DSP

    What Makes the SHARC unique?

    It also has some features not related directly

    related to optimizing numeric computations.

    Pipelining

    Handling Branches

    Why has this not emerged sooner?Technology has only recently become available

    to make it economical to integrate general

    single computing devices.

  • 7/31/2019 14907_sharc

    15/26

    Clare Smtih SHARC Presentation15

    SHARCs Pipeline

    3 stages1 Instruction Fetch

    2 Decode

    3 Execution

    Takes three clock cycles for an instruction

    to propagate through the pipeline. The processor execution speed is one

    instruction per clock cycle even though

    each instruction requires three clock cycles.

  • 7/31/2019 14907_sharc

    16/26

    16

    SHARCs Handling Branches

    Delayed Branching

    When a branch instruction is encountered

    the two instructions which have been loaded

    and decoded are executed before the branch.

    This keeps the pipeline full and avoids

    junking those two instructions and reloading

    the pipeline.

    Beneficial in situations such as a few

    instruction loops. When the ratio of wasted

    clock cycles to instructions is significant.

  • 7/31/2019 14907_sharc

    17/26

    Clare Smtih SHARC Presentation17

    SHARCs Handling Branches

    Non-delayed Branching

    Traditional branching.

    If the pipeline cannot be reordered to use

    delayed branching, non-delayed branching

    is space saving.

    Uses only one word of storage.

    Although, it takes three cycles as the

    pipeline gets reloaded.

  • 7/31/2019 14907_sharc

    18/26

    Clare Smtih SHARC Presentation18

    Multi-processing

    SHARC is uniquely equipped for multi-

    processing.

    Links to ports are very powerful multi-

    processing capabilities.

    Two main program models depending on

    the application.

    Adapts well to different multi-processing

    architectures.

  • 7/31/2019 14907_sharc

    19/26

    Clare Smtih SHARC Presentation19

    Multi-processingSHARC Links

    SHARC has 6 link ports that can transport

    data at rates up to 40Mbytes/sec.

    Links designed for point-to-point

    connections.

    Data can be transmitted in either direction

    but not both simultaneously.

  • 7/31/2019 14907_sharc

    20/26

    Clare Smtih SHARC Presentation20

    Multi-processing Program Model

    MIMD

    Multiple instruction, multiple data.

    Good for applications that require multiple

    instruction threads to execute concurrently.

    Processors operate individually. Each processor executes different code.

    Typically used for image reconstruction and

    multi-channel DSP.

  • 7/31/2019 14907_sharc

    21/26

    Clare Smtih SHARC Presentation21

    Multi-processing Program ModelSIMD

    Single instruction, multiple data.

    Works best when all processors execute

    identical instruction sequences.

    Do not require overhead for inter-processor

    synchronization.

    Typically used for synthetic aperture radar

    and automatic target recognition.

  • 7/31/2019 14907_sharc

    22/26

    Clare Smtih SHARC Presentation22

    Multi-processing ArchitecturesCluster Design

    Groups of up to 6 in a cluster

    Most common for joining multiple

    SAHRC's

    All processors, global I/O and global

    memory connected to a common

    Cluster bus.

    Each SHARC can drive the bus.

  • 7/31/2019 14907_sharc

    23/26

    23

    Multi-processing ArchitecturesMesh Design

    All SHARCs joined by their link ports and

    are connected to a common bus.

    In SIMD mode one single master SHARC

    drives the bus.

    In MIMD mode mesh architecture cannot

    function if data is lager then on chip

    available memory.

    Advantageous scalability over a wider range

    of applications.

  • 7/31/2019 14907_sharc

    24/26

    Clare Smtih SHARC Presentation24

    Summary of what makes the

    SHARCSuper

    It performs excellently for DSP

    applications.

    Employs a Harvard Architecture with very

    large on chip memory.

    Respectable Megaflop rating.

    Its multiprocessing capabilities.

  • 7/31/2019 14907_sharc

    25/26

    Clare Smtih SHARC Presentation25

    How optimal is the SHARC for

    non-DSP Applications?

    It is obviously geared for DSP applications.

    While it may fare better then other

    processors it is still behind those which are

    designed specifically for non-DSP

    applications.

  • 7/31/2019 14907_sharc

    26/26

    Clare Smtih SHARC Presentation26

    Sources

    www.alacron.com/news/tp_mimd_simd.htm

    www.analog.com

    www.cs.seas.gwu.edu/~cs339/cs339-

    lecture2.pdf

    www.ixthos.aa.psiweb.com/technical/notes

    _articles/articles