Upload
gkudrethaya
View
222
Download
0
Embed Size (px)
Citation preview
8/10/2019 09 Accelerators(1).ppt
1/47
Digital Design: An Embedded Systems Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by MorganKaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
8/10/2019 09 Accelerators(1).ppt
2/47
Verilog
Digital Design Chapter 9 Accelerators 2
Performance and Parallelism
A processor core performs steps in sequencePerformance limited by the instruction rate
Accelerating performancePerform steps in parallelTakes less time overall to complete an operation
Instruction-level parallelismWithin a processor core
Pipelining, multiple-issue Accelerators
Custom hardware for parallel operations
8/10/2019 09 Accelerators(1).ppt
3/47
Verilog
Digital Design Chapter 9 Accelerators 3
Achievable Parallelism
How many steps can be performed atonce?Regularly structured data
Independent processing stepsExamples Video and image pixel processing Audio or sensor signal processing
Constrained by data dependenciesOperations that depend on results ofprevious steps
8/10/2019 09 Accelerators(1).ppt
4/47
Verilog
Digital Design Chapter 9 Accelerators 4
Algorithm Kernels
Algorithm: specification of the requiredprocessing steps
Often expressed in a programming
languageKernel: the part that involves the mostintensive, repetitive processing
10% of operations take 90% of the time Accelerating a kernel with parallelhardware gives the best payback
8/10/2019 09 Accelerators(1).ppt
5/47
Verilog
Digital Design Chapter 9 Accelerators 5
Amdahls Law
Time for an algorithm is tFraction f is spent on a kernel
t f f t t )1(
Accelerator speeds upkernel by a factor s t f s f t t )1(
Overall speedup factor s'For large f , s' sFor small f , s' 1
)1(
1
f s f t
t s
8/10/2019 09 Accelerators(1).ppt
6/47
Verilog
Digital Design Chapter 9 Accelerators 6
Amdahls Law Example
An algorithm with two kernelsKernel 1: 80% of time, can be sped up 10 timesKernel 2: 15% of time, can be sped up 100 timesWhich speedup gives best overall improvement?
For kernel 1:
For kernel 2:
57.32.008.0
1
)8.01(10
8.01
s
17.185.00015.0
1
)15.01(100
15.01
s
8/10/2019 09 Accelerators(1).ppt
7/47
Verilog
Digital Design Chapter 9 Accelerators 7
Parallel Architectures
An architecture for an acceleratorspecifies
Processing blocks
Data flow between themParallelism through replication
Multiple identical block operating on
different data elementsWorks well when elements can beprocessed independently
8/10/2019 09 Accelerators(1).ppt
8/47
Verilog
Digital Design Chapter 9 Accelerators 8
Parallel Architectures
Parallelism through pipeliningBreak a computation into steps, performs them inassembly-line fashionLatency (time to complete a single operation) isnot increasedThroughput (rate of completion of operations) isincreased
Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3dataindataout
8/10/2019 09 Accelerators(1).ppt
9/47
Verilog
Digital Design Chapter 9 Accelerators 9
Direct Memory Access (DMA)
Input/Output data for accelleratorsmust be transferred at high speed
Using the processor would be too slow
Direct memory accessI/O controller and accellerator transferdata to and from memory autononously
Program supplies starting address andlength
8/10/2019 09 Accelerators(1).ppt
10/47
Verilog
Digital Design Chapter 9 Accelerators 10
Bus Arbitration
Bus masters take turns to use bus to accessslaves
Controlled by a bus arbiter
Arbitration policiesPriority, round-robin,
processor
memory
arbiter
accelerator controller
request
grant
request
request
grant
grant
memory bus
8/10/2019 09 Accelerators(1).ppt
11/47
Verilog
Digital Design Chapter 9 Accelerators 11
Block-Processing Accelerator
Data arranged in regular groups ofcontiguous memory locations Accelerator works block by blockE.g., images in blocks of 8 8 16-bitpixels
Datapath comprisesMemory access: address generation,
countersComputation sectionControl section: finite-state machine(s)
8/10/2019 09 Accelerators(1).ppt
12/47
Verilog
Digital Design Chapter 9 Accelerators 12
Stream-Processing Accelerator
Streams of data from an input sourceE.g., high-speed sensors
Digital signal processing (DSP) Analog sensor signal converted to streamof digital sample valuesFiltering, gain/attenuation, frequency-
domain conversion (Fourier transform)
8/10/2019 09 Accelerators(1).ppt
13/47
Verilog
Digital Design Chapter 9 Accelerators 13
Processor/Accelerator Interface
Embedded software controls anaccelerator
Providing control parameters
Synchronizing operationsInput/output registers and interrupts
Interact with the control sequencer
8/10/2019 09 Accelerators(1).ppt
14/47
Verilog
Digital Design Chapter 9 Accelerators 14
Case Study: Edge Detection
Illustration of accelerator designEdge detection in video processing
Identify where image intensity changes abruptlyTypically at the boundary of objects
First step in identifying objects in a scene Application areas
Video surveillance, computer vision, For this case study
Monochrome images of 640 480 8-bit pixelsStored row-by-row in memoryPixel values: 0 (black) 255 (white)
l
8/10/2019 09 Accelerators(1).ppt
15/47
Verilog
Digital Design Chapter 9 Accelerators 15
Sobel Edge Detection
Compute derivatives of intensity in xand y directionsLook for minima and maxima (whereintensity changes most rapidly)
V il
8/10/2019 09 Accelerators(1).ppt
16/47
Verilog
Digital Design Chapter 9 Accelerators 16
The Sobel Algorithm
Use convolution to approximate partialderivatives D x and D y at each positionWeighted sum of value of a pixel and its eightnearest neighbors
Coefficients represented using a 3 3 convolutionmaskSobel masks for x and y derivatives
1 0 +1
2 0 +2
1 0 +2 xG
+1 +2 +1
0 0 0
1 2 1 yG
x x G jiO ji D ),(),( y y G jiO ji D ),(),(
V il
8/10/2019 09 Accelerators(1).ppt
17/47
Verilog
Digital Design Chapter 9 Accelerators 17
The Sobel Algorithm
Combine partial derivatives22
y x D D D
Since we just want maxima and minimain magnitude, approximate as:
y x D D D
Edge pixels dont have eight neighbors Skip computation of | D| for edgesJust set them to 0 using software
V il g
8/10/2019 09 Accelerators(1).ppt
18/47
Verilog
Digital Design Chapter 9 Accelerators 18
The Algorithm in Pseudocode
for (row = 1; row
8/10/2019 09 Accelerators(1).ppt
19/47
Verilog
Digital Design Chapter 9 Accelerators 19
Data Formats and Rates
Pixel values: 0 to 255 (8 bits)Coefficients are 0, 1 and 2Partial products: 510 to +510 (10 bits)
D x and D y: 1020 to +1020 (11 bits) | D|: 0 to 2040 (11 bits)Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec640 480 = 307,200 pixels307,200 30 10 million pixels/sec
Verilog
8/10/2019 09 Accelerators(1).ppt
20/47
Verilog
Digital Design Chapter 9 Accelerators 20
Data Dependencies
Pixels can be computed independentlyFor each pixel:
Verilog
8/10/2019 09 Accelerators(1).ppt
21/47
Verilog
Digital Design Chapter 9 Accelerators 21
System Architecture
Data dependencies suggest a pipelineCoefficient multiplies are simple shift/negate, somerge with adder stage
Verilog
8/10/2019 09 Accelerators(1).ppt
22/47
Verilog
Digital Design Chapter 9 Accelerators 22
Memory Bandwidth
Assume memory read/write takes 20ns(2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable
Bandwidth = 50M operations/secCamera produces 10Mpixels/sec
Accelerator needs to process at this rate
(8 reads + 1 write) 10Mpixel/sec= 90M operations/secGreater than memory bandwidth
Verilog
8/10/2019 09 Accelerators(1).ppt
23/47
Verilog
Digital Design Chapter 9 Accelerators 23
Memory Bandwidth
Read 4 pixels at once from each of previous,current, and next rowsStore in accelerator to compute multiple derivativeimage pixels
Produce derivative pixels row-by-row, left-to-rightRead 3 32-bit words for every 4 th derivativepixel computed
Write 4 pixels at a time(3 reads + 1 write) / 4 10Mpixel/sec= 10M operations/sec= 20% of available memory bandwidth
Verilog
8/10/2019 09 Accelerators(1).ppt
24/47
Verilog
Digital Design Chapter 9 Accelerators 24
Sobel Accelerator Architecture
Verilog
8/10/2019 09 Accelerators(1).ppt
25/47
Verilog
Digital Design Chapter 9 Accelerators 25
Accelerator Sequence
Steady stateWrite 4 result pixelsRead 4 pixels for previous,current, next rowsCompute for 4 cyclesRepeat
Start of rowOmit writes until pipelinefull
End of rowOmit reads to drainpipeline
Verilog
8/10/2019 09 Accelerators(1).ppt
26/47
Verilog
Digital Design Chapter 9 Accelerators 26
Memory Operation Timing
Steady state
Verilog
8/10/2019 09 Accelerators(1).ppt
27/47
g
Digital Design Chapter 9 Accelerators 27
Pixel Datapath
// Computation datapath signalsreg [31:0] prev_row, curr_row, next_row;reg [7:0] O [-1:+1][-1:+1];reg signed [10:0] Dx, Dy, D;reg [7:0] abs_D;reg [31:0] result_row;
...// Computational datapathalways @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row
8/10/2019 09 Accelerators(1).ppt
28/47
Verilog
8/10/2019 09 Accelerators(1).ppt
29/47
g
Digital Design Chapter 9 Accelerators 29
Pixel Datapath
O[-1][-1]
8/10/2019 09 Accelerators(1).ppt
30/47
g
Digital Design Chapter 9 Accelerators 30
Address Generation
Given an image in memory at baseaddress B Address for pixel in row r, column c is B + r 640 + cBase address ( B) is fixedOffset ( r 640 + c) increments by 4 foreach group of 4 pixels read/written
Use word-aligned addressesTwo least-significant bits always 00Increment word address by 1
Verilog
8/10/2019 09 Accelerators(1).ppt
31/47
Digital Design Chapter 9 Accelerators 31
Address Generation
Verilog
8/10/2019 09 Accelerators(1).ppt
32/47
Digital Design Chapter 9 Accelerators 32
Address Generation
always @(posedge clk_i) // O base address registerif (O_base_ce) O_base
8/10/2019 09 Accelerators(1).ppt
33/47
Digital Design Chapter 9 Accelerators 33
Address Generation
assign O_prev_addr = O_base + O_offset;assign O_curr_addr = O_prev_addr + 640/4;assign O_next_addr = O_prev_addr + 1280/4;assign D_addr = D_base + D_offset;assign adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :next_row_load ? O_next_addr :D_addr;
assign adr_o[1:0] = 2'b00;
Verilog
8/10/2019 09 Accelerators(1).ppt
34/47
Digital Design Chapter 9 Accelerators 34
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start
(value ignored).O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640.
Status 0 Read-only Processing done (bit 0). Reading clearsinterrupt.
Verilog
8/10/2019 09 Accelerators(1).ppt
35/47
Digital Design Chapter 9 Accelerators 35
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;always @(posedge clk_i) // Interrupt enable register
if (rst_i)int_en
8/10/2019 09 Accelerators(1).ppt
36/47
Verilog
8/10/2019 09 Accelerators(1).ppt
37/47
Digital Design Chapter 9 Accelerators 37
Control Sequencing
Use a finite-state machineCounters keep track of rows (0 to 477) andcolumns (0 to 159)
See textbook for details of FSM outputfunctions
Verilog
8/10/2019 09 Accelerators(1).ppt
38/47
Digital Design Chapter 9 Accelerators 38
State Transition Diagram
Verilog
8/10/2019 09 Accelerators(1).ppt
39/47
Digital Design Chapter 9 Accelerators 39
Accelerator Verification
Simulation-based verification of each sectionof the acceleratorSlave bus operationsComputation sequencingMaster bus operations
Address generationPixel computation
Testbench including the acceleratorBus functional processor modelSimplified memory and bus arbiter models
Verilog
8/10/2019 09 Accelerators(1).ppt
40/47
Digital Design Chapter 9 Accelerators 40
Sobel Verification Testbench
ProcessorBFM
Sobel Accelerator
MemoryModel
Arbiter
Multiplexed Bus: Muxes and Connections
Verilog
8/10/2019 09 Accelerators(1).ppt
41/47
Digital Design Chapter 9 Accelerators 41
Processor Bus Functional Modelinitial begin // Processor bus-functional model
cpu_adr_o
8/10/2019 09 Accelerators(1).ppt
42/47
Digital Design Chapter 9 Accelerators 42
Processor Bus Functional Modelcpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;begin: loop
forever begin#10000;@(posedge clk);// Read status registercpu_adr_o
8/10/2019 09 Accelerators(1).ppt
43/47
Digital Design Chapter 9 Accelerators 43
Memory Bus Functional Model
always begin // Memory bus-functional modelmem_ack_o
8/10/2019 09 Accelerators(1).ppt
44/47
Digital Design Chapter 9 Accelerators 44
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o as request inputsIf both request at the same time, giveaccelerator priority
Mealy FSM
Verilog
8/10/2019 09 Accelerators(1).ppt
45/47
Digital Design Chapter 9 Accelerators 45
Bus Arbiteralways @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state
8/10/2019 09 Accelerators(1).ppt
46/47
Digital Design Chapter 9 Accelerators 46
Simulation Results
See waveforms in textbookDemonstrates sequencing and addressgeneration
But what about Data values computed correctlyInteractions between processor andaccelerator
Need to use more sophisticatedverification techniquesDue to complexity of the design
Verilog
8/10/2019 09 Accelerators(1).ppt
47/47
Summary
Accelerators boost performance usingparallel hardware
Replication, pipelining,
Ahmdahls Law Best payback from accelerating a kernel
DMA avoids processor overhead
Verification requires advancedtechniques