DSP Processor Fundamentals

Embed Size (px)

Citation preview

  • 7/22/2019 DSP Processor Fundamentals

    1/58

    Slide: 1

    DSP Processor Fundamentals

    Subhasish Mukherjee

  • 7/22/2019 DSP Processor Fundamentals

    2/58

    Slide: 2

    Salient Features of DSP Processors

    Fast multiply and accumulate Multiple access memory architecture Specialized addressing modes Specialized execution control Peripherals and I/O interfaces

  • 7/22/2019 DSP Processor Fundamentals

    3/58

    Slide: 3

    DSP Processor Embodiments

    Multichip modulesMultiple dies in a single package

    Increased operating speed & reduced power dissipation Multiple processors on chip Chip sets

    Dividing the processor into two or more packages

    Makes sense when the processor is very complex & has large noof I/O pins

    Saves cost DSP Cores

  • 7/22/2019 DSP Processor Fundamentals

    4/58

    Slide: 4

    Fixed-Point vs. Floating Point

    Most DSP are Fixed-PointFixed Point DSP support integer and fraction arithmetic

    Limited dynamic range and precision

    Cheaper too.

    Mostly use 16-bit format, though some use 20/24 bit format.

    Floating point DSPs use mantissa and exponentrepresentation

    They provide good dynamic range and precisionMostly use 32-bit format

    Easier to program

  • 7/22/2019 DSP Processor Fundamentals

    5/58

    Slide: 5

    Fixed Point Data Path

  • 7/22/2019 DSP Processor Fundamentals

    6/58

    Slide: 6

    Content of Fixed Point Data Path

    Typically incorporate a multiplier, an ALU,shifters, operand registers & accumulators.

    Single cycle multipliers are central toprogrammable DSP

    Often integrated with adder to make a multiply

    accumulate unit.

  • 7/22/2019 DSP Processor Fundamentals

    7/58

    Slide: 7

    Accumulator

    Holds intermediate and final results of MACoperation

    Most DSP processors provide multiple

    Accumulator. Have guard bits to accumulate a number of

    values

    Guard bits provide greater flexibility thanscaling.

  • 7/22/2019 DSP Processor Fundamentals

    8/58

    Slide: 8

    ALU

    Implements basic arithmetic and logicaloperations in a single instruction cycle.

    Common operations include add, subtract, increment,negate, logical and, or, not.

    Differs in the word size used for logicaloperations.

  • 7/22/2019 DSP Processor Fundamentals

    9/58

    Slide: 9

    Shifter

    Used for scaling the input by a power of 2 Either eliminates or reduces the possibilities of

    overflow to an acceptably lower level. Trade off is loss of precision and dynamic

    range. Barrel shifters offers more flexibility

  • 7/22/2019 DSP Processor Fundamentals

    10/58

    Slide: 10

    Memory Architecture&

    Addressing Schemes

  • 7/22/2019 DSP Processor Fundamentals

    11/58

    Slide: 11

    Motivation

    FIR Filter involves followingoperations

    Fetch the MAC instruction

    Fetch coefficient h m

    Fetch delayed input x(n-m)

    Multiply both

    Add with the previous result

    Shift data in the delay line

    The above set of operationsdone for all the taps of thefilter for each sample

    z- z- z-

    h0 h1 h2 h N-1 h N

    Input x(n)

    Output y(n)

    )()0

    ()( mn x N

    mmhn y

  • 7/22/2019 DSP Processor Fundamentals

    12/58

    Slide: 12

    Motivation

    Conventional processors need more than 5 cycles/tap/sample to implementthe above FIR filter

    DSP architectures try to reduce the cycles needed to compute this primitive This is accomplished by

    Harvard architecture

    Efficient addressing modes

  • 7/22/2019 DSP Processor Fundamentals

    13/58

    Slide: 13

    Harvard rchitecture

    Basic Harvard Architecture

    Separate program and data bus

    different from Von-Neumann Architecture

    Modification 1

    Data fetches possible fromprogram memory

    Opcode and one data fetch donein parallel

    Basic Harvard Architecture

    ProgramMemory

    DataMemory

    P BUS D BUS

    Harvard Architecture Modification #1

    Program/Data

    Memory

    DataMemory

    P BUS D BUS

  • 7/22/2019 DSP Processor Fundamentals

    14/58

    Slide: 14

    Harvard rchitecture

    Harvard Architecture Modification 2

    ProgramMemory

    Multi PortData

    Memory

    P BUS D BUS 1

    D BUS 2

    Modification 2

    One program memory

    One dual ported data memory

    3 busses for the internal memory 2 for data

    1 for program

    2 data words can be fetched inparallel to an instruction

  • 7/22/2019 DSP Processor Fundamentals

    15/58

    Slide: 15

    Harvard rchitecture

    Harvard Architecture Modification 3

    ProgramMemory

    DataMemory 1

    P BUS D BUS 1

    DataMemory 2

    D BUS 2

    ProgramCache

    Modification 3

    One program memory & Program Cache

    Two Data memory

    3 busses for the internal memory 2 for data & 1 for program

    2 data words can be fetched in parallel to an instruction

  • 7/22/2019 DSP Processor Fundamentals

    16/58

    Slide: 16

    ddressing mode Circular ddressing

    Avoids shifting of data in the delayline

    Oldest element is overwritten by thenew element

    Pointer wraps around once it crossesstart or the end of the circular buffer

    Need to maintain 5 parameters forcircular buffer operation

    Circular buffer - Example

    X(n)X(n-1)

    X(n-2)

    X(n-3)

    X(n-4)X(n-5)

    X(n-6)

    X(n-7)

    Recent sample at time instant n

    2nd recent sample at time instant n+1

    Oldest sample at time instant n

    Will be overwritten by the recentsample at instant n+1

    X(n-m)

    X(n)

    X(n-m-1)

    X(n-N)

    Start

    End

  • 7/22/2019 DSP Processor Fundamentals

    17/58

    Slide: 17

    Multiple Access Memories

    Supports multiple, sequential access perinstruction cycle Can be combined with Harvard Architecture to

    have better performance Supporting off-chip memory means introducing

    significant additional delay between processorcore and memory

  • 7/22/2019 DSP Processor Fundamentals

    18/58

    Slide: 18

    Multiported Memories

    Has multiple independent sets of address anddata connections Can provide multiple simultaneous accesses

    Costly Supporting off-chip memory means larger and

    more expensive package

  • 7/22/2019 DSP Processor Fundamentals

    19/58

    Slide: 19

    Program Cache

    Simplest type is single instruction repeat buffer Can be extended to multi word repeat buffer Another type is single sector instruction cache Extended to multiple independent sector cache Used only for program instructions and not for

    data

  • 7/22/2019 DSP Processor Fundamentals

    20/58

    Slide: 20

    Wait States

    State in which processor waits to accessmemory Conflict Wait states

    Multiple access to memory that can not handlemultiple access Externally requested wait states

    Multiple processors sharing a data bus

    TMS320C5x has a special READY pin which can beused by external hardware to signal the processorthat it must wait before accessing external memory.

  • 7/22/2019 DSP Processor Fundamentals

    21/58

    Slide: 21

    Multiprocessor Support- Memory Interface

    Multiple external memory ports Sometimes multiple processors share one

    external memory bus

    Bus arbitration requiredTwo pins can be configured to act as bus requestand bus grant signals

    TMS320C5x allows external access to on chipmemory through BR and IAQ signals Helpful formultiprocessor communication without sharedmemory

  • 7/22/2019 DSP Processor Fundamentals

    22/58

    Slide: 22

    Direct Memory Access

    Handled by DMA controller Coupled with Bus Request and Bus Grant pins of

    the processor

    Some sophisticated DMA controllers reside on-chip and access on chip memory

    Multiple channel DMA controllers handle

    multiple memory transfer in parallel

  • 7/22/2019 DSP Processor Fundamentals

    23/58

    Slide: 23

    Memory Addressing Schemes

    Implied addressingOperand addresses are implied

    P = X * Y

    Immediate data

    Operand itself is encoded in the instruction

    AX0 = 1234

    Memory direct addressing

    The address of the data in memory is enclosed in the instructionword

    AX0 = DM(1000)

  • 7/22/2019 DSP Processor Fundamentals

    24/58

    Slide: 24

    Memory Addressing Schemes

    Register direct addressingData being addressed reside in a register

    SUBF R1, R2

    Register indirect addressingData resides in memory and the address resides inthe register, A0 = A0 + *R5

    Address Registers Memory

    7 0x10000x1000

  • 7/22/2019 DSP Processor Fundamentals

    25/58

    Slide: 25

    Memory Addressing Schemes

    Register indirect addressing with pre and postincrement

    A0 = A0 + *R5++ (Post Increment)

    A0 = A0 + *R5++R17 (Post Increment) Address incremented by the value stored in register R17

    MOVE X: -(R0), A1 (Pre Decrement)

  • 7/22/2019 DSP Processor Fundamentals

    26/58

    Slide: 26

    Memory Addressing Schemes

    Register indirect addressing with indexing Values stored in two address registers are added toform an effective address

    Does not change the content of any of the addressregisters

    MOVE Y1, X: (R6 + N6)

    LDI *-AR1(1), R7

  • 7/22/2019 DSP Processor Fundamentals

    27/58

    Slide: 27

    Memory Addressing Schemes

    Register addressing with bit reversalUsed for FFT

    The output or input is in a scrambled order

    000 = 0

    100 = 4

    010 = 2110 = 6

    001 = 1

    101 = 5

    011 = 3111 = 7

  • 7/22/2019 DSP Processor Fundamentals

    28/58

    Slide: 28

    Instruction Set

  • 7/22/2019 DSP Processor Fundamentals

    29/58

    Slide: 29

    Instruction Types

    Arithmetic & Multiplication Logic Operations Shifting Rotation

    Comparison Looping Branching, subroutine calls and returns Conditional instruction Special function instruction

    Block floating point instructions, stack operation etc. Bit manipulation instructions

  • 7/22/2019 DSP Processor Fundamentals

    30/58

    Slide: 30

    Registers

    Accumulators General & special purpose registers Address registers Other registers

    Stack pointer

    Program counter

    Loop registers

  • 7/22/2019 DSP Processor Fundamentals

    31/58

    Slide: 31

    Parallel Move Support

    Operand related parallel movesMPY (R0), (R4)

    Accesses are limited to those required by arithmeticoperation

    Operand unrelated parallel moves

    MPY X0, Y0, A X: (R0)+, X0 Y1, Y: (R4)+

    Memory accesses unrelated to the operands of the ALU operation

  • 7/22/2019 DSP Processor Fundamentals

    32/58

    Slide: 32

    Orthogonality

    Indicates the extent to which processorinstruction set is consistent Depends upon

    Consistency & Completeness of the instruction setDegree to which operands and addressing modes areuniformly available with different operations

  • 7/22/2019 DSP Processor Fundamentals

    33/58

    Slide: 33

    Assembly Language Format

    Traditional opcode operand variety

    C Like Syntax

    MPY X0, Y0ADD P,A

    MOV (R0), X0

    JMP LOOP

    P = X0 * Y0

    A = P + AX0 = *R0

    GOTO LOOP

  • 7/22/2019 DSP Processor Fundamentals

    34/58

    Slide: 34

    Execution Control

  • 7/22/2019 DSP Processor Fundamentals

    35/58

    Slide: 35

    Looping

    Hardware looping

    Software looping

    RPT #16

    MAC (R0)+, (R4)+, A

    MOVE #16, B

    LOOP: MAC (R0)+, (R4)+, A

    DEC BJNE LOOP

  • 7/22/2019 DSP Processor Fundamentals

    36/58

    Slide: 36

    Considerations in Looping

    Sometimes 0 loop repetition count causes theprocessor to repeat the loop the maximumnumber of times

    Consider loop effects on interrupt latency

  • 7/22/2019 DSP Processor Fundamentals

    37/58

    Slide: 37

    Nesting

    Directly nestableHardware loop instruction placed within the outerloop

    Partially nestableSingle instruction loop inside multi instruction loop

    Software nestable

    Multi instruction hardware loops are nested by savingvarious registers like loop index, loop start & loopcount

  • 7/22/2019 DSP Processor Fundamentals

    38/58

    Slide: 38

    Interrupts

    Interrupt sourcesOn chip peripherals, External interrupt lines andsoftware interrupts

    Interrupt vectors Associating each interrupt with a different memoryaddress

    Typically one or two word long and are located in lowmemory

    Usually contains a branch or subroutine call to aninterrupt handler routine

    l

  • 7/22/2019 DSP Processor Fundamentals

    39/58

    Slide: 39

    Interrupt latency

    Time between the assertion of an external interrupt lineto the execution of the first word of the interrupt vector Following adds up to the interrupt latency

    Interrupt line to be asserted prior to the start of an instruction

    cycle when interrupt is said to have occurred (Set up time)To be passed through synchronization stages

    Wait until the processor reaches an interruptible state

    Wait until all instructions in the pipeline are finished

    If interrupt vector holds only address of the interrupt routinethen the time required to branch to that location

    k

  • 7/22/2019 DSP Processor Fundamentals

    40/58

    Slide: 40

    Stacks

    Typically one of the three kinds of stack supportis provided

    Shadow registers

    Hardware stackSoftware stack

  • 7/22/2019 DSP Processor Fundamentals

    41/58

    Slide: 41

    Pipelining

    Pi li i d P f

  • 7/22/2019 DSP Processor Fundamentals

    42/58

    Slide: 42

    Pipelining and Performance

    Technique for increasing the performance of aprocessorBreaks a sequence of operations into smaller pieces

    Execute the pieces in parallel whenever possible Hypothetical processor

    Fetch an instruction word from memory

    Decode the instruction

    Read/write data operands from/to memory

    Execute the ALU or MAC operation of the instruction

    Pi li i d P f

  • 7/22/2019 DSP Processor Fundamentals

    43/58

    Slide: 43

    Pipelining and Performance

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    I1 I2 I3 I4 I5 I6 I7

    I1 I2 I3 I4 I5 I6

    I1 I2 I3 I4 I5

    I1 I2 I3 I4

    1 2 3 4 5 6 7P

    I

    P

    E

    LI

    N

    E

    D

    E

    P

    TH

    Perfect Overlap

    100% utilization of processor execution stages

    Ideal scenario

    C fli i I i

  • 7/22/2019 DSP Processor Fundamentals

    44/58

    Slide: 44

    Conflicting Instruction

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    I1 I2 I3 I4 I5 I6 I7

    I1 I2 I3 I4 I5 I6

    I1 I2 I2 I3 I4 I5

    I1 I2 I3 I4

    1 2 3 4 5 6 7P

    I

    P

    E

    LI

    N

    E

    D

    E

    P

    TH

    I2 tries to write to memory while I3 tries to read memory

    Solution to this problem is interlocking

    Interlocking is delaying the conflicting instruction in pipeline

    I l ki

  • 7/22/2019 DSP Processor Fundamentals

    45/58

    Slide: 45

    Interlocking

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    I1 I2 I3 I4 I4 I5 I6

    I1 I2 I3 I3 I4 I5

    I1 I2 I2 I3 I4

    I1 I2 NOP I3

    1 2 3 4 5 6 7P

    I

    P

    E

    LI

    N

    E

    D

    E

    P

    TH

    Interlocking resolves resource conflict Pipeline sequencer holds instruction I3 at the decode stage

    I4 is held at the fetch stage

    One instruction cycle penalty occurs

    M lti l B hi Eff t

  • 7/22/2019 DSP Processor Fundamentals

    46/58

    Slide: 46

    Multicycle Branching Effects

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    BR I2 --- --- I4 I5 I6 I7

    BR --- --- --- I4 I5 I6

    BR --- --- --- I4 I5

    BR NOP NOP NOP I4

    1 2 3 4 5 6 7

    When a branch instruction reaches the decode stage already one instruction isfetched which has to be flushed from the pipeline NOPs are executed for the invalidated pipeline slots

    Multicycle branch typically executes for as many cycles as pipeline depth

    D l d B hi Eff t

  • 7/22/2019 DSP Processor Fundamentals

    47/58

    Slide: 47

    Delayed Branching Effects

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    BR N2 N3 N4 I4 I5 I6 I7

    BR N2 N3 N4 I4 I5 I6

    BR N2 N3 N4 I4 I5

    BR N2 N3 N4 I4

    1 2 3 4 5 6 7

    An alternative to multicycle branch, does not flush the pipeline

    Instructions to be executed before the branch instruction must be locatedexactly after the branch instruction in the memory

    Increased efficiency and confusing code on casual inspection

    I t t Eff t

  • 7/22/2019 DSP Processor Fundamentals

    48/58

    Slide: 48

    Interrupt Effects

    Instruction Fetch

    Decode

    DataRead/Write

    Execute

    Clock Cycle

    I6 --- --- --- V1 V2 V3 V4

    I5 INTR --- --- --- V1 V2 V3

    I4 I5 INTR --- --- --- V1 V2

    I3 I4 I5 INTR NOP NOP NOP V1

    3 4 5 6 7 8 9 10

    Processor inserts the INTR instruction in the pipeline INTR is a special branch instruction that flushes the pipeline and jumps to theappropriate interrupt vector location

    Causes a 4 cycle delay before the first word of the interrupt vector is executed

    I6 is flushed but would be refetched on returning from interrupt

    INETRRUPT

    F t I t t P i g

  • 7/22/2019 DSP Processor Fundamentals

    49/58

    Slide: 49

    Fast Interrupt Processing

    Instruction Fetch

    Decode

    Execute

    Clock Cycle

    I3 I4 V1 V2 I5 I6 I7 I8

    I2 I3 I4 V1 V2 I5 I6 I7

    I1 I2 I3 I4 V1 V2 I5 I6

    1 2 3 4 5 6 7 8

    Interrupt handler stored at the interrupt vector location In this case V1 & V2 are the two instructions in the interrupt vector

    This is called fast inter ru pt as this does not insert any delay in the pipeline

    INETRRUPT

  • 7/22/2019 DSP Processor Fundamentals

    50/58

    Slide: 50

    Peripherals

    Serial Ports

  • 7/22/2019 DSP Processor Fundamentals

    51/58

    Slide: 51

    Serial Ports

    Serial interface transmits and receives data onebit at a time Requires far fewer interface pins than parallel

    interface Used for variety of applications

    Sending/receiving data to/from A/D and D/Aconverters

    Sending/receiving data from other processors or DSP

    Communicating with other external peripherals

    Serial Ports

  • 7/22/2019 DSP Processor Fundamentals

    52/58

    Slide: 52

    Serial Ports

    SynchronousTransmits one bit clock signal in addition to the serialdata bits

    Receiver uses that for sampling the received data

    Asynchronous

    Do not transmit separate clock signal

    Receiver deduces the clock signal from the serialdata itself

    More complex

    Data and Clock

  • 7/22/2019 DSP Processor Fundamentals

    53/58

    Slide: 53

    Data and Clock

    0 1 - - - - -

    BIT

    CLOCK

    FRAME

    SYNC

    DATA

    Most DSPs allow changing the clock polarity, data polarity and shift direction

    Frame sync signal indicates the position of the first bit of a data word on theserial data line

    Common formats are bit length and word length

    Also can have multiple words per frame

    Serial Clock Generation

  • 7/22/2019 DSP Processor Fundamentals

    54/58

    Slide: 54

    Serial Clock Generation

    Provide Circuitry for clock generation Usually called serial clock generation support Normally done by scaling the master clock in

    DSP Usually contains a pre-scaler and a down

    counter

    Time Division Multiplex

  • 7/22/2019 DSP Processor Fundamentals

    55/58

    Slide: 55

    Time Division Multiplex

    CLOCK

    FRAME SYNC

    DATA

    CLOCK

    FRAME

    SYNCDATA

    CLOCK

    FRAME SYNC

    DATA

    CLOCK

    FRAME SYNC

    DATA

    CLOCK

    FRAME SYNC

    DATA

    One processor (or External Circuitry) generates the clock and Frame sync signal

    Frame sync indicates the start of a new set of time slots

    Transmitted data word might contain some number of bits to indicate thedestination DSP. Other bits are used for data

    DSP DSP DSP DSP

    Timers

  • 7/22/2019 DSP Processor Fundamentals

    56/58

    Slide: 56

    Timers

    Programmable timers are often a source of periodicinterrupts

    May also be used as a software controlled square wavegenerator

    Clock Source

    Prescale Preload Value Counter Preload Value

    Parallel Ports

  • 7/22/2019 DSP Processor Fundamentals

    57/58

    Slide: 57

    Parallel Ports

    Transmit/receive multiple data bits at a time Faster than serial ports but require more pins External data bus may be used as a parallel port Can also have separate parallel ports

    Bit I/O portsIndividual pins can be made input or output on a bit by bit basis

    Host ports

    Specialized 8/16 bit bidirectional parallel ports used for data transferbetween DSP and host microprocessor

    May be used to control the DSP Communication ports

    Special parallel port intended for multiprocessor communication

  • 7/22/2019 DSP Processor Fundamentals

    58/58