09 Accelerators(1).ppt

Embed Size (px)

Citation preview

  • 8/10/2019 09 Accelerators(1).ppt

    1/47

    Digital Design: An Embedded Systems Approach Using Verilog

    Chapter 9

    Accelerators

    Portions of this work are from the book, Digital Design: An Embedded

    Systems Approach Using Verilog, by Peter J. Ashenden, published by MorganKaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.

  • 8/10/2019 09 Accelerators(1).ppt

    2/47

    Verilog

    Digital Design Chapter 9 Accelerators 2

    Performance and Parallelism

    A processor core performs steps in sequencePerformance limited by the instruction rate

    Accelerating performancePerform steps in parallelTakes less time overall to complete an operation

    Instruction-level parallelismWithin a processor core

    Pipelining, multiple-issue Accelerators

    Custom hardware for parallel operations

  • 8/10/2019 09 Accelerators(1).ppt

    3/47

    Verilog

    Digital Design Chapter 9 Accelerators 3

    Achievable Parallelism

    How many steps can be performed atonce?Regularly structured data

    Independent processing stepsExamples Video and image pixel processing Audio or sensor signal processing

    Constrained by data dependenciesOperations that depend on results ofprevious steps

  • 8/10/2019 09 Accelerators(1).ppt

    4/47

    Verilog

    Digital Design Chapter 9 Accelerators 4

    Algorithm Kernels

    Algorithm: specification of the requiredprocessing steps

    Often expressed in a programming

    languageKernel: the part that involves the mostintensive, repetitive processing

    10% of operations take 90% of the time Accelerating a kernel with parallelhardware gives the best payback

  • 8/10/2019 09 Accelerators(1).ppt

    5/47

    Verilog

    Digital Design Chapter 9 Accelerators 5

    Amdahls Law

    Time for an algorithm is tFraction f is spent on a kernel

    t f f t t )1(

    Accelerator speeds upkernel by a factor s t f s f t t )1(

    Overall speedup factor s'For large f , s' sFor small f , s' 1

    )1(

    1

    f s f t

    t s

  • 8/10/2019 09 Accelerators(1).ppt

    6/47

    Verilog

    Digital Design Chapter 9 Accelerators 6

    Amdahls Law Example

    An algorithm with two kernelsKernel 1: 80% of time, can be sped up 10 timesKernel 2: 15% of time, can be sped up 100 timesWhich speedup gives best overall improvement?

    For kernel 1:

    For kernel 2:

    57.32.008.0

    1

    )8.01(10

    8.01

    s

    17.185.00015.0

    1

    )15.01(100

    15.01

    s

  • 8/10/2019 09 Accelerators(1).ppt

    7/47

    Verilog

    Digital Design Chapter 9 Accelerators 7

    Parallel Architectures

    An architecture for an acceleratorspecifies

    Processing blocks

    Data flow between themParallelism through replication

    Multiple identical block operating on

    different data elementsWorks well when elements can beprocessed independently

  • 8/10/2019 09 Accelerators(1).ppt

    8/47

    Verilog

    Digital Design Chapter 9 Accelerators 8

    Parallel Architectures

    Parallelism through pipeliningBreak a computation into steps, performs them inassembly-line fashionLatency (time to complete a single operation) isnot increasedThroughput (rate of completion of operations) isincreased

    Ideally by a factor equal to the number of pipeline stages

    step 1 step 2 step 3dataindataout

  • 8/10/2019 09 Accelerators(1).ppt

    9/47

    Verilog

    Digital Design Chapter 9 Accelerators 9

    Direct Memory Access (DMA)

    Input/Output data for accelleratorsmust be transferred at high speed

    Using the processor would be too slow

    Direct memory accessI/O controller and accellerator transferdata to and from memory autononously

    Program supplies starting address andlength

  • 8/10/2019 09 Accelerators(1).ppt

    10/47

    Verilog

    Digital Design Chapter 9 Accelerators 10

    Bus Arbitration

    Bus masters take turns to use bus to accessslaves

    Controlled by a bus arbiter

    Arbitration policiesPriority, round-robin,

    processor

    memory

    arbiter

    accelerator controller

    request

    grant

    request

    request

    grant

    grant

    memory bus

  • 8/10/2019 09 Accelerators(1).ppt

    11/47

    Verilog

    Digital Design Chapter 9 Accelerators 11

    Block-Processing Accelerator

    Data arranged in regular groups ofcontiguous memory locations Accelerator works block by blockE.g., images in blocks of 8 8 16-bitpixels

    Datapath comprisesMemory access: address generation,

    countersComputation sectionControl section: finite-state machine(s)

  • 8/10/2019 09 Accelerators(1).ppt

    12/47

    Verilog

    Digital Design Chapter 9 Accelerators 12

    Stream-Processing Accelerator

    Streams of data from an input sourceE.g., high-speed sensors

    Digital signal processing (DSP) Analog sensor signal converted to streamof digital sample valuesFiltering, gain/attenuation, frequency-

    domain conversion (Fourier transform)

  • 8/10/2019 09 Accelerators(1).ppt

    13/47

    Verilog

    Digital Design Chapter 9 Accelerators 13

    Processor/Accelerator Interface

    Embedded software controls anaccelerator

    Providing control parameters

    Synchronizing operationsInput/output registers and interrupts

    Interact with the control sequencer

  • 8/10/2019 09 Accelerators(1).ppt

    14/47

    Verilog

    Digital Design Chapter 9 Accelerators 14

    Case Study: Edge Detection

    Illustration of accelerator designEdge detection in video processing

    Identify where image intensity changes abruptlyTypically at the boundary of objects

    First step in identifying objects in a scene Application areas

    Video surveillance, computer vision, For this case study

    Monochrome images of 640 480 8-bit pixelsStored row-by-row in memoryPixel values: 0 (black) 255 (white)

    l

  • 8/10/2019 09 Accelerators(1).ppt

    15/47

    Verilog

    Digital Design Chapter 9 Accelerators 15

    Sobel Edge Detection

    Compute derivatives of intensity in xand y directionsLook for minima and maxima (whereintensity changes most rapidly)

    V il

  • 8/10/2019 09 Accelerators(1).ppt

    16/47

    Verilog

    Digital Design Chapter 9 Accelerators 16

    The Sobel Algorithm

    Use convolution to approximate partialderivatives D x and D y at each positionWeighted sum of value of a pixel and its eightnearest neighbors

    Coefficients represented using a 3 3 convolutionmaskSobel masks for x and y derivatives

    1 0 +1

    2 0 +2

    1 0 +2 xG

    +1 +2 +1

    0 0 0

    1 2 1 yG

    x x G jiO ji D ),(),( y y G jiO ji D ),(),(

    V il

  • 8/10/2019 09 Accelerators(1).ppt

    17/47

    Verilog

    Digital Design Chapter 9 Accelerators 17

    The Sobel Algorithm

    Combine partial derivatives22

    y x D D D

    Since we just want maxima and minimain magnitude, approximate as:

    y x D D D

    Edge pixels dont have eight neighbors Skip computation of | D| for edgesJust set them to 0 using software

    V il g

  • 8/10/2019 09 Accelerators(1).ppt

    18/47

    Verilog

    Digital Design Chapter 9 Accelerators 18

    The Algorithm in Pseudocode

    for (row = 1; row

  • 8/10/2019 09 Accelerators(1).ppt

    19/47

    Verilog

    Digital Design Chapter 9 Accelerators 19

    Data Formats and Rates

    Pixel values: 0 to 255 (8 bits)Coefficients are 0, 1 and 2Partial products: 510 to +510 (10 bits)

    D x and D y: 1020 to +1020 (11 bits) | D|: 0 to 2040 (11 bits)Final pixel value: scale back to 8 bits

    Video rate: 30 frames/sec640 480 = 307,200 pixels307,200 30 10 million pixels/sec

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    20/47

    Verilog

    Digital Design Chapter 9 Accelerators 20

    Data Dependencies

    Pixels can be computed independentlyFor each pixel:

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    21/47

    Verilog

    Digital Design Chapter 9 Accelerators 21

    System Architecture

    Data dependencies suggest a pipelineCoefficient multiplies are simple shift/negate, somerge with adder stage

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    22/47

    Verilog

    Digital Design Chapter 9 Accelerators 22

    Memory Bandwidth

    Assume memory read/write takes 20ns(2 cycles of 100MHz clock)

    Memory is 32-bits wide, byte addressable

    Bandwidth = 50M operations/secCamera produces 10Mpixels/sec

    Accelerator needs to process at this rate

    (8 reads + 1 write) 10Mpixel/sec= 90M operations/secGreater than memory bandwidth

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    23/47

    Verilog

    Digital Design Chapter 9 Accelerators 23

    Memory Bandwidth

    Read 4 pixels at once from each of previous,current, and next rowsStore in accelerator to compute multiple derivativeimage pixels

    Produce derivative pixels row-by-row, left-to-rightRead 3 32-bit words for every 4 th derivativepixel computed

    Write 4 pixels at a time(3 reads + 1 write) / 4 10Mpixel/sec= 10M operations/sec= 20% of available memory bandwidth

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    24/47

    Verilog

    Digital Design Chapter 9 Accelerators 24

    Sobel Accelerator Architecture

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    25/47

    Verilog

    Digital Design Chapter 9 Accelerators 25

    Accelerator Sequence

    Steady stateWrite 4 result pixelsRead 4 pixels for previous,current, next rowsCompute for 4 cyclesRepeat

    Start of rowOmit writes until pipelinefull

    End of rowOmit reads to drainpipeline

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    26/47

    Verilog

    Digital Design Chapter 9 Accelerators 26

    Memory Operation Timing

    Steady state

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    27/47

    g

    Digital Design Chapter 9 Accelerators 27

    Pixel Datapath

    // Computation datapath signalsreg [31:0] prev_row, curr_row, next_row;reg [7:0] O [-1:+1][-1:+1];reg signed [10:0] Dx, Dy, D;reg [7:0] abs_D;reg [31:0] result_row;

    ...// Computational datapathalways @(posedge clk_i) // Previous row register

    if (prev_row_load) prev_row

  • 8/10/2019 09 Accelerators(1).ppt

    28/47

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    29/47

    g

    Digital Design Chapter 9 Accelerators 29

    Pixel Datapath

    O[-1][-1]

  • 8/10/2019 09 Accelerators(1).ppt

    30/47

    g

    Digital Design Chapter 9 Accelerators 30

    Address Generation

    Given an image in memory at baseaddress B Address for pixel in row r, column c is B + r 640 + cBase address ( B) is fixedOffset ( r 640 + c) increments by 4 foreach group of 4 pixels read/written

    Use word-aligned addressesTwo least-significant bits always 00Increment word address by 1

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    31/47

    Digital Design Chapter 9 Accelerators 31

    Address Generation

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    32/47

    Digital Design Chapter 9 Accelerators 32

    Address Generation

    always @(posedge clk_i) // O base address registerif (O_base_ce) O_base

  • 8/10/2019 09 Accelerators(1).ppt

    33/47

    Digital Design Chapter 9 Accelerators 33

    Address Generation

    assign O_prev_addr = O_base + O_offset;assign O_curr_addr = O_prev_addr + 640/4;assign O_next_addr = O_prev_addr + 1280/4;assign D_addr = D_base + D_offset;assign adr_o[21:2] = prev_row_load ? O_prev_addr :

    curr_row_load ? O_curr_addr :next_row_load ? O_next_addr :D_addr;

    assign adr_o[1:0] = 2'b00;

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    34/47

    Digital Design Chapter 9 Accelerators 34

    Control/Status Registers

    Register Offset Read/Write Purpose

    Int_en 0 Write-only Interrupt enable (bit 0).

    Start 4 Write-only Write causes image processing to start

    (value ignored).O_base 8 Write-only Original image base address.

    D_base 12 Write-only Derivative image base address + 640.

    Status 0 Read-only Processing done (bit 0). Reading clearsinterrupt.

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    35/47

    Digital Design Chapter 9 Accelerators 35

    Slave Bus Interface

    assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;always @(posedge clk_i) // Interrupt enable register

    if (rst_i)int_en

  • 8/10/2019 09 Accelerators(1).ppt

    36/47

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    37/47

    Digital Design Chapter 9 Accelerators 37

    Control Sequencing

    Use a finite-state machineCounters keep track of rows (0 to 477) andcolumns (0 to 159)

    See textbook for details of FSM outputfunctions

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    38/47

    Digital Design Chapter 9 Accelerators 38

    State Transition Diagram

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    39/47

    Digital Design Chapter 9 Accelerators 39

    Accelerator Verification

    Simulation-based verification of each sectionof the acceleratorSlave bus operationsComputation sequencingMaster bus operations

    Address generationPixel computation

    Testbench including the acceleratorBus functional processor modelSimplified memory and bus arbiter models

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    40/47

    Digital Design Chapter 9 Accelerators 40

    Sobel Verification Testbench

    ProcessorBFM

    Sobel Accelerator

    MemoryModel

    Arbiter

    Multiplexed Bus: Muxes and Connections

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    41/47

    Digital Design Chapter 9 Accelerators 41

    Processor Bus Functional Modelinitial begin // Processor bus-functional model

    cpu_adr_o

  • 8/10/2019 09 Accelerators(1).ppt

    42/47

    Digital Design Chapter 9 Accelerators 42

    Processor Bus Functional Modelcpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;begin: loop

    forever begin#10000;@(posedge clk);// Read status registercpu_adr_o

  • 8/10/2019 09 Accelerators(1).ppt

    43/47

    Digital Design Chapter 9 Accelerators 43

    Memory Bus Functional Model

    always begin // Memory bus-functional modelmem_ack_o

  • 8/10/2019 09 Accelerators(1).ppt

    44/47

    Digital Design Chapter 9 Accelerators 44

    Bus Arbiter

    Uses sobel_cyc_o and cpu_cyc_o as request inputsIf both request at the same time, giveaccelerator priority

    Mealy FSM

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    45/47

    Digital Design Chapter 9 Accelerators 45

    Bus Arbiteralways @(posedge clk) // Arbiter FSM register

    if (rst) arbiter_current_state

  • 8/10/2019 09 Accelerators(1).ppt

    46/47

    Digital Design Chapter 9 Accelerators 46

    Simulation Results

    See waveforms in textbookDemonstrates sequencing and addressgeneration

    But what about Data values computed correctlyInteractions between processor andaccelerator

    Need to use more sophisticatedverification techniquesDue to complexity of the design

    Verilog

  • 8/10/2019 09 Accelerators(1).ppt

    47/47

    Summary

    Accelerators boost performance usingparallel hardware

    Replication, pipelining,

    Ahmdahls Law Best payback from accelerating a kernel

    DMA avoids processor overhead

    Verification requires advancedtechniques