26-27 SIMD Architecture

Embed Size (px)

Citation preview

  • 8/19/2019 26-27 SIMD Architecture

    1/33

    Rehan AzmatLecture26-27

    SIMD Architecture

  • 8/19/2019 26-27 SIMD Architecture

    2/33

    Introduction and MotivationArchitecture classificationPerformance of Parallel Architectures

    Interconnection Network

  • 8/19/2019 26-27 SIMD Architecture

    3/33

    Array processorsVector processorsCray X1

    Multimedia extensions

  • 8/19/2019 26-27 SIMD Architecture

    4/33

    Manipulation of arrays or vectors is a common operation inscientific and engineering applications.

    Typical operations of array-oriented data include:◦ Processing one or more vectors to produce a scalar result.◦ Combining two vectors to produce a third one.◦

    Combining a scalar and a vector to generate a vector.◦ A combination of the above three operations.

    Two architectures suitable for vector processing have evolved:◦ Pipelined vector processors

    Implemented in many supercomputers◦ Parallel array processors

    Compiler does some of the difficult work of finding parallelism,so the hardware doesn’t have to.

    ◦ Data parallelism.

  • 8/19/2019 26-27 SIMD Architecture

    5/33

  • 8/19/2019 26-27 SIMD Architecture

    6/33

  • 8/19/2019 26-27 SIMD Architecture

    7/33

    Strictly speaking, vector processors are not parallel processors.◦ They only behave like SIMD computers.

    There are not several CPUs in a vector processor, running in parallel.◦ They are SISD processors with vector instructions executed on pipelined functional

    units.

    Vector computers usually have vector registers which can store each 64up to 128 words.

    Vector instructions examples:◦ Load vector from memory into vector register◦ Store vector into memory◦ Arithmetic and logic operations between vectors◦ Operations between vectors and scalars

    The programmers are allowed to use operations on vectors in theprograms, and the compiler translates these operations into vectorinstructions at machine level.

  • 8/19/2019 26-27 SIMD Architecture

    8/33

    A vector unit typically consists of◦ pipelined functional units◦ vector registers

    Vector registers:◦ n general purpose vector registers Ri, 0 ≤ i ≤ n -1;◦ vector length register VL: stores the length l (0 ≤ l ≤ s),

    of the currently processed vector; s is the length of thevector registers.

    ◦ mask register M: stores a set of l bits, one for eachelement in a vector, interpreted as Boolean values;

    vector instructions can be executed in masked mode so thatvector register elements corresponding to a false value in Mare ignored.

  • 8/19/2019 26-27 SIMD Architecture

    9/33

    Consider an element-by-element addition of two N-element vectors Aand B to create the sum vector C.

    On an SISD machine, this computation will be implemented as:

    for i = 0 to N-1 doC[i] := A[i] + B[i];

    ◦ There will be N*K instruction fetches (K instructions are needed for each iteration)and N additions.

    ◦ There will also be N conditional branches, if loop unrolling is not used.

    A compiler for a vector computer generates something like:

    C[0:N- 1] ← A[0:N -1] + B[0:N-1];

    ◦ Even though N additions will still be performed, there will only be K ’ instructionfetches (e.g., Load A, Load B, Add, Write C = 4 instructions).

    ◦ No conditional branch is needed.

  • 8/19/2019 26-27 SIMD Architecture

    10/33

    Advantages◦ Quick fetch and decode of a single instruction for multiple

    operations.◦ The instruction provides the processor with a regular

    source of data, which can arrive at each cycle, and beprocessed in a pipelined fashion regularly.

    ◦ The compiler does the work for you.

    Memory-to-memory operation mode◦ no registers.◦ can process very long vectors, but start-up time is large.◦

    appeared in the 70s and died in the 80s.

    Register-to-register operations are more commonwith new machines.

  • 8/19/2019 26-27 SIMD Architecture

    11/33

  • 8/19/2019 26-27 SIMD Architecture

    12/33

    It is composed of N identical processing elements under thecontrol of a single control unit and a number of memorymodules.

    ◦ The PEs execute instruction in a lock-step mode.

    Processing units and memory elements communicate with each

    other through an interconnection network.◦ Different topologies can be used.

    Complexity of the control unit is at the same level of theuniprocessor system.

    Control unit is a computer with its own high speed registers,local memory and arithmetic logic unit.

    The main memory is the aggregate of the memory modules.

  • 8/19/2019 26-27 SIMD Architecture

    13/33

    Processing element complexity◦ Single-bit processors

    Connection Machine (CM-2) 65536 PEs connected by ahypercube network (by Thinking Machine Corporation).

    ◦ Multi-bit processorsILLIAC IV (64-bit), MasPar MP-1 (32-bit)

    Processor-memory interconnection◦ Dedicated memory organization

    ILLIAC IV, CM-2, MP-1◦ Global memory organization

    Bulk Synchronous Parallel (BSP) computer

  • 8/19/2019 26-27 SIMD Architecture

    14/33

  • 8/19/2019 26-27 SIMD Architecture

    15/33

  • 8/19/2019 26-27 SIMD Architecture

    16/33

    Control and scalar type instructions are executed in thecontrol unit.

    Vector instructions are performed in the processingelements.

    Data structuring and detection of parallelism in a programare the major issues in the application of array processor.

    Operations such as C i) = A i) × B i , 1 ≤ i ≤ n could beexecuted in parallel, if the elements of the arrays A and B

    are distributed properly among the processors or memorymodules.◦ Ex. PEi is assigned the task of computing C i).

  • 8/19/2019 26-27 SIMD Architecture

    17/33

    To compute

    Assuming:A dedicated memory organization.

    Elements of A and B are properly and perfectly distributedamong processors (the compiler can help here).

    We have:The product terms are generated in parallel.Additions can be performed in log2 N iterations.

    Speed up factor (S) is:

  • 8/19/2019 26-27 SIMD Architecture

    18/33

    ILLIAC IV development started in the late 60’s; fullyoperational in 1975.

    SIMD computer for array processing.

    Control Unit + 64 Processing Elements.◦ 2K words memory per PE.

    CU can access all memory.

    PEs can access local memory and communicate withneighbors.

    CU reads program and broadcasts instructions to PEs.

  • 8/19/2019 26-27 SIMD Architecture

    19/33

  • 8/19/2019 26-27 SIMD Architecture

    20/33

  • 8/19/2019 26-27 SIMD Architecture

    21/33

    Cray combines several technologies in the X1(2003)◦ 12.8 Gflop/s vector processors◦ Shared caches

    4 processor nodes sharing up to 64 GB of memory◦ Multi-streaming vector processing◦ Multiple node architecture

  • 8/19/2019 26-27 SIMD Architecture

    22/33

    MSP: Multi Streaming vector Processor◦ Formed by 4 SSPs (each a 2-pipe vector processor)◦ Balance computations across SSPs.◦ Compiler will try to vectorize/parallelize across the

    MSP, achieving “streaming”

  • 8/19/2019 26-27 SIMD Architecture

    23/33

  • 8/19/2019 26-27 SIMD Architecture

    24/33

  • 8/19/2019 26-27 SIMD Architecture

    25/33

    Many levels of parallelism◦ Within a processor: vectorization◦ Within an MSP: streaming◦ Within a node: shared memory◦ Across nodes: message passing

    Some are automated by the compiler, somerequire work by the programmer◦ This is a common trend◦ The more complex the architecture, the more difficult it

    is for the programmer to exploit it

    Hard to fit this machine into a simple taxonomy!

  • 8/19/2019 26-27 SIMD Architecture

    26/33

    How do we extend general purpose microprocessors so that they canhandle multimedia applications efficiently.

    Analysis of the need:

    Video and audio applications very often deal with large arrays of smalldata types (8 or 16 bits).

    Such applications exhibit a large potential of SIMD (vector) parallelism.◦ Data parallelism.

    Solutions:

    New generations of general purpose microprocessors are equipped withspecial instructions to exploit this parallelism.

    The specialized multimedia instructions perform vector computations onbytes, half-words, or words.

  • 8/19/2019 26-27 SIMD Architecture

    27/33

    Several vendors have extended the instruction set of theirprocessors in order to improve performance withmultimedia applications:◦ MMX for Intel x86 family◦ VIS for UltraSparc◦ MDMX for MIPS◦

    MAX-2 for Hewlett-Packard PA-RISC

    The Pentium line provides 57 MMX instructions. They treatdata in a SIMD fashion to improve the performance of◦ Computer-aided design◦ Internet application◦

    Computer visualization◦ Video games◦ Speech recognition

  • 8/19/2019 26-27 SIMD Architecture

    28/33

    The basic idea: sub-w ord execution

    Use the entire width of a processor data path (32 or64 bits), even when processing the small data typesused in signal processing (8, 12, or 16 bits).

    With word size 64 bits, an adder can be used toimplement eight 8-bit additions in parallel.

    MMX technology allows a single instruction to work

    on multiple pieces of data.Consequently we have practically a kind of SIMDparallelism, at a reduced scale.

  • 8/19/2019 26-27 SIMD Architecture

    29/33

    Three packed data types are defined forparallel operations: packed byte, packedword, packed double word.

  • 8/19/2019 26-27 SIMD Architecture

    30/33

  • 8/19/2019 26-27 SIMD Architecture

    31/33

    The following shows the performance ofPentium processors with and without MMXtechnology:

  • 8/19/2019 26-27 SIMD Architecture

    32/33

    Vector processors are SISD processors which include in their instructionset instructions operations on vectors.◦ They are implemented using pipelined functional units.◦ They behave like SIMD machines.

    Array processors, being typical SIMD, execute the same operation on aset of interconnected processing units.

    Both vector and array processors are specialized for numerical problemsexpressed in matrix or vector formats.

    Many modern architectures deploy usually several parallel architectureconcepts at the same time, such as Cray X1.

    Multimedia applications exhibit a large potential of SIMD parallelism.◦ The instruction set of modern microprocessors has been extended to support SIMD-

    style parallelism with operations on short vectors.

  • 8/19/2019 26-27 SIMD Architecture

    33/33

    End of the Lecture