09 Accelerators(1).ppt

8/10/2019 09 Accelerators(1).ppt

1/47

Digital Design: An Embedded Systems Approach Using Verilog

Chapter 9

Accelerators

Portions of this work are from the book, Digital Design: An Embedded

Systems Approach Using Verilog, by Peter J. Ashenden, published by MorganKaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.


2/47

Verilog

Digital Design Chapter 9 Accelerators 2

Performance and Parallelism

A processor core performs steps in sequencePerformance limited by the instruction rate

Accelerating performancePerform steps in parallelTakes less time overall to complete an operation

Instruction-level parallelismWithin a processor core

Pipelining, multiple-issue Accelerators

Custom hardware for parallel operations


3/47

Verilog


Achievable Parallelism

How many steps can be performed atonce?Regularly structured data

Independent processing stepsExamples Video and image pixel processing Audio or sensor signal processing

Constrained by data dependenciesOperations that depend on results ofprevious steps


4/47

Verilog


Algorithm Kernels

Algorithm: specification of the requiredprocessing steps

Often expressed in a programming

languageKernel: the part that involves the mostintensive, repetitive processing

10% of operations take 90% of the time Accelerating a kernel with parallelhardware gives the best payback


5/47

Verilog


Amdahls Law

Time for an algorithm is tFraction f is spent on a kernel

t f f t t )1(

Accelerator speeds upkernel by a factor s t f s f t t )1(

Overall speedup factor s'For large f , s' sFor small f , s' 1

)1(

1

f s f t

t s


6/47

Verilog


Amdahls Law Example

An algorithm with two kernelsKernel 1: 80% of time, can be sped up 10 timesKernel 2: 15% of time, can be sped up 100 timesWhich speedup gives best overall improvement?

For kernel 1:

For kernel 2:

57.32.008.0

1

)8.01(10

8.01

s

17.185.00015.0

1

)15.01(100

15.01

s


7/47

Verilog


Parallel Architectures

An architecture for an acceleratorspecifies

Processing blocks

Data flow between themParallelism through replication

Multiple identical block operating on

different data elementsWorks well when elements can beprocessed independently


8/47

Verilog


Parallel Architectures

Parallelism through pipeliningBreak a computation into steps, performs them inassembly-line fashionLatency (time to complete a single operation) isnot increasedThroughput (rate of completion of operations) isincreased

Ideally by a factor equal to the number of pipeline stages

step 1 step 2 step 3dataindataout


9/47

Verilog


Direct Memory Access (DMA)

Input/Output data for accelleratorsmust be transferred at high speed

Using the processor would be too slow

Direct memory accessI/O controller and accellerator transferdata to and from memory autononously

Program supplies starting address andlength


10/47

Verilog


Bus Arbitration

Bus masters take turns to use bus to accessslaves

Controlled by a bus arbiter

Arbitration policiesPriority, round-robin,

processor

memory

arbiter

accelerator controller

request

grant

request

request

grant

grant

memory bus


11/47

Verilog


Block-Processing Accelerator

Data arranged in regular groups ofcontiguous memory locations Accelerator works block by blockE.g., images in blocks of 8 8 16-bitpixels

Datapath comprisesMemory access: address generation,

countersComputation sectionControl section: finite-state machine(s)


12/47

Verilog


Stream-Processing Accelerator

Streams of data from an input sourceE.g., high-speed sensors

Digital signal processing (DSP) Analog sensor signal converted to streamof digital sample valuesFiltering, gain/attenuation, frequency-

domain conversion (Fourier transform)


13/47

Verilog


Processor/Accelerator Interface

Embedded software controls anaccelerator

Providing control parameters

Synchronizing operationsInput/output registers and interrupts

Interact with the control sequencer


14/47

Verilog


Case Study: Edge Detection

Illustration of accelerator designEdge detection in video processing

Identify where image intensity changes abruptlyTypically at the boundary of objects

First step in identifying objects in a scene Application areas

Video surveillance, computer vision, For this case study

Monochrome images of 640 480 8-bit pixelsStored row-by-row in memoryPixel values: 0 (black) 255 (white)

l


15/47

Verilog


Sobel Edge Detection

Compute derivatives of intensity in xand y directionsLook for minima and maxima (whereintensity changes most rapidly)

V il


16/47

Verilog


The Sobel Algorithm

Use convolution to approximate partialderivatives D x and D y at each positionWeighted sum of value of a pixel and its eightnearest neighbors

Coefficients represented using a 3 3 convolutionmaskSobel masks for x and y derivatives

1 0 +1

2 0 +2

1 0 +2 xG

+1 +2 +1

0 0 0

1 2 1 yG

x x G jiO ji D ),(),( y y G jiO ji D ),(),(

V il


17/47

Verilog


The Sobel Algorithm

Combine partial derivatives22

y x D D D

Since we just want maxima and minimain magnitude, approximate as:

y x D D D

Edge pixels dont have eight neighbors Skip computation of | D| for edgesJust set them to 0 using software

V il g


18/47

Verilog


The Algorithm in Pseudocode

for (row = 1; row


19/47

Verilog


Data Formats and Rates

Pixel values: 0 to 255 (8 bits)Coefficients are 0, 1 and 2Partial products: 510 to +510 (10 bits)

D x and D y: 1020 to +1020 (11 bits) | D|: 0 to 2040 (11 bits)Final pixel value: scale back to 8 bits

Video rate: 30 frames/sec640 480 = 307,200 pixels307,200 30 10 million pixels/sec

Verilog


20/47

Verilog


Data Dependencies

Pixels can be computed independentlyFor each pixel:

Verilog


21/47

Verilog


System Architecture

Data dependencies suggest a pipelineCoefficient multiplies are simple shift/negate, somerge with adder stage

Verilog


22/47

Verilog


Memory Bandwidth

Assume memory read/write takes 20ns(2 cycles of 100MHz clock)

Memory is 32-bits wide, byte addressable

Bandwidth = 50M operations/secCamera produces 10Mpixels/sec

Accelerator needs to process at this rate

(8 reads + 1 write) 10Mpixel/sec= 90M operations/secGreater than memory bandwidth

Verilog


23/47

Verilog


Memory Bandwidth

Read 4 pixels at once from each of previous,current, and next rowsStore in accelerator to compute multiple derivativeimage pixels

Produce derivative pixels row-by-row, left-to-rightRead 3 32-bit words for every 4 th derivativepixel computed

Write 4 pixels at a time(3 reads + 1 write) / 4 10Mpixel/sec= 10M operations/sec= 20% of available memory bandwidth

Verilog


24/47

Verilog


Sobel Accelerator Architecture

Verilog


25/47

Verilog


Accelerator Sequence

Steady stateWrite 4 result pixelsRead 4 pixels for previous,current, next rowsCompute for 4 cyclesRepeat

Start of rowOmit writes until pipelinefull

End of rowOmit reads to drainpipeline

Verilog


26/47

Verilog


Memory Operation Timing

Steady state

Verilog


27/47

g


Pixel Datapath

// Computation datapath signalsreg [31:0] prev_row, curr_row, next_row;reg [7:0] O [-1:+1][-1:+1];reg signed [10:0] Dx, Dy, D;reg [7:0] abs_D;reg [31:0] result_row;

...// Computational datapathalways @(posedge clk_i) // Previous row register

if (prev_row_load) prev_row


28/47

Verilog


29/47

g


Pixel Datapath

O[-1][-1]


30/47

g


Address Generation

Given an image in memory at baseaddress B Address for pixel in row r, column c is B + r 640 + cBase address ( B) is fixedOffset ( r 640 + c) increments by 4 foreach group of 4 pixels read/written

Use word-aligned addressesTwo least-significant bits always 00Increment word address by 1

Verilog


31/47


Address Generation

Verilog


32/47


Address Generation

always @(posedge clk_i) // O base address registerif (O_base_ce) O_base


33/47


Address Generation

assign O_prev_addr = O_base + O_offset;assign O_curr_addr = O_prev_addr + 640/4;assign O_next_addr = O_prev_addr + 1280/4;assign D_addr = D_base + D_offset;assign adr_o[21:2] = prev_row_load ? O_prev_addr :

curr_row_load ? O_curr_addr :next_row_load ? O_next_addr :D_addr;

assign adr_o[1:0] = 2'b00;

Verilog


34/47


Control/Status Registers

Register Offset Read/Write Purpose

Int_en 0 Write-only Interrupt enable (bit 0).

Start 4 Write-only Write causes image processing to start

(value ignored).O_base 8 Write-only Original image base address.

D_base 12 Write-only Derivative image base address + 640.

Status 0 Read-only Processing done (bit 0). Reading clearsinterrupt.

Verilog


35/47


Slave Bus Interface

assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;always @(posedge clk_i) // Interrupt enable register

if (rst_i)int_en


36/47

Verilog


37/47


Control Sequencing

Use a finite-state machineCounters keep track of rows (0 to 477) andcolumns (0 to 159)

See textbook for details of FSM outputfunctions

Verilog


38/47


State Transition Diagram

Verilog


39/47


Accelerator Verification

Simulation-based verification of each sectionof the acceleratorSlave bus operationsComputation sequencingMaster bus operations

Address generationPixel computation

Testbench including the acceleratorBus functional processor modelSimplified memory and bus arbiter models

Verilog


40/47


Sobel Verification Testbench

ProcessorBFM

Sobel Accelerator

MemoryModel

Arbiter

Multiplexed Bus: Muxes and Connections

Verilog


41/47


Processor Bus Functional Modelinitial begin // Processor bus-functional model

cpu_adr_o


42/47


Processor Bus Functional Modelcpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;begin: loop

forever begin#10000;@(posedge clk);// Read status registercpu_adr_o


43/47


Memory Bus Functional Model

always begin // Memory bus-functional modelmem_ack_o


44/47


Bus Arbiter

Uses sobel_cyc_o and cpu_cyc_o as request inputsIf both request at the same time, giveaccelerator priority

Mealy FSM

Verilog


45/47


Bus Arbiteralways @(posedge clk) // Arbiter FSM register

if (rst) arbiter_current_state


46/47


Simulation Results

See waveforms in textbookDemonstrates sequencing and addressgeneration

But what about Data values computed correctlyInteractions between processor andaccelerator

Need to use more sophisticatedverification techniquesDue to complexity of the design

Verilog


47/47

Summary

Accelerators boost performance usingparallel hardware

Replication, pipelining,

Ahmdahls Law Best payback from accelerating a kernel

DMA avoids processor overhead

Verification requires advancedtechniques

Documents

09 Accelerators(1).ppt