52
Program Optimization Methodology Program Optimization Methodology E. E. Dokladalova Dokladalova ESIEE, ESIEE, November November 2011 2011

Program Optimization Methodology

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Program Optimization Methodology

Program Optimization MethodologyProgram Optimization Methodology

E. E. DokladalovaDokladalova

ESIEE, ESIEE, NovemberNovember 20112011

Page 2: Program Optimization Methodology

2

OutlineOutline

►► ObjectivesObjectives

►► Basic notionsBasic notions

►► Practical example 1: Practical example 1: DericheDeriche filterfilter

►► Practical example 2: Hough transformPractical example 2: Hough transform

►► Project introductionProject introduction

►► ConclusionsConclusions

Page 3: Program Optimization Methodology

3

ObjectivesObjectives

►► Learn and experience software optimization methodology Learn and experience software optimization methodology

Donald E. Knuth (1974): …We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil…” [1]

►► Apply previously acquired notions of program optimization Apply previously acquired notions of program optimization techniquestechniques

Page 4: Program Optimization Methodology

4

Basic notionsBasic notions

►► Software optimizationSoftware optimizationApplication of a collection of methods/techniques allowing to imApplication of a collection of methods/techniques allowing to improve prove

the software performances in terms of the software performances in terms of ►► Execution timeExecution time

►► Memory occupation (data, code)Memory occupation (data, code)

►► Power budgetPower budget

►► ……

►► Software optimization methodology Software optimization methodology Study of methods (and their relations) that have been applied wStudy of methods (and their relations) that have been applied within ithin

the software optimization domainthe software optimization domain; ; ►► Defines general guidelines for the software optimization Defines general guidelines for the software optimization

Page 5: Program Optimization Methodology

5

MethodologyMethodology

►► Algorithm Algorithm –– Architecture Matching (Architecture Matching (AdAdééquationquation))

�� Aims to study simultaneously both Aims to study simultaneously both algorithmic and architectural algorithmic and architectural issuesissues

�� Takes into account multiple implementation constraints, as well Takes into account multiple implementation constraints, as well as algorithm and architecture optimizations, that couldnas algorithm and architecture optimizations, that couldn’’t be t be achieved otherwise if considered separately. achieved otherwise if considered separately.

Page 6: Program Optimization Methodology

6

Methodology (II)Methodology (II)

►► Algorithm Algorithm –– Architecture Matching (Architecture Matching (AdAdééquationquation))�� Allows solving the problem of optimized software implementation Allows solving the problem of optimized software implementation on existing on existing

hardware (RISC, DSP, VLIW, hardware (RISC, DSP, VLIW, ……))

�� Proposes improved and automated design flows for specialized arcProposes improved and automated design flows for specialized architectures, hitectures, optimized for given application field optimized for given application field

Applicationwith constraints

X

Page 7: Program Optimization Methodology

7

Levels of software optimizationLevels of software optimization

►► Algorithm design levelAlgorithm design level�� Choice of algorithm Choice of algorithm

►► Source code levelSource code level�� Produce the good quality of codeProduce the good quality of code

(Critical parts of code in assembler)(Critical parts of code in assembler)

►► Compiler levelCompiler level�� Automatic optimization using the compiler capabilitiesAutomatic optimization using the compiler capabilities

with respect to the with respect to the features of hardware features of hardware architecture resourcesarchitecture resources

Page 8: Program Optimization Methodology

8

Source program and compiler levelSource program and compiler level

►► Source program levelSource program level►► RegisterRegister rotationrotation

►► LoopLoop unrollingunrolling

►► Software pipelineSoftware pipeline

►► Data Data localitylocality exploitation exploitation

►► Respect Respect ofof thethe hardware architecture hardware architecture featuresfeatures

�� RISCRISC

�� SIMD, VLIWSIMD, VLIW

�� DMADMA

�� HiHiéérarchie mrarchie méémoiremoire

►► Compiler levelCompiler level►► Automatic loop unrolling, optimization of memory accesses, localAutomatic loop unrolling, optimization of memory accesses, local code code optimization, pipeline optimization optimization, pipeline optimization ……

Used in practical session on winDLX

Page 9: Program Optimization Methodology

9

Practical considerationsPractical considerations

►► Contribution of each optimization level to the final gain of perContribution of each optimization level to the final gain of performancesformances

►► Algorithm Algorithm

►► Source codeSource code

►► CompilerCompiler

Page 10: Program Optimization Methodology

10

Algorithm design level optimizationAlgorithm design level optimization

►► Choice of algorithmChoice of algorithm

�� Example: some of N integersExample: some of N integers

►► Data typeData type

int i, sum;

sum = 0;

for (i = 1; i <= N; ++i)

sum += i;

int i sum;

sum = N * (1 + N) / 2;

nono38 38 clkclkFloat DivFloat Div

116 6 clkclkFloat Multiply and addFloat Multiply and add

1110 10 clkclkInteger MultiplyInteger Multiply

331 1 clkclkInteger AddInteger Add

Pentium III/IVPentium III/IVLatency Pipelined ?Latency Pipelined ?

Processor/instructionProcessor/instruction

(32 bits)(32 bits)

Page 11: Program Optimization Methodology

11

Algorithm design level optimizationAlgorithm design level optimization

►► Design of algorithm defines:Design of algorithm defines:�� Data dependency Data dependency

►► Locally dependent data, globally dependent dataLocally dependent data, globally dependent data

�� Data structures and amount of occupied memoryData structures and amount of occupied memory►► Tables, lists, unions, trees Tables, lists, unions, trees ……

�� Data accessData access►► Random access, streaming, parallel accessRandom access, streaming, parallel access……

�� Data types/precisionData types/precision►► Integer, floating point, double, Integer, floating point, double, ……

�� Number of operationsNumber of operations►► Data load/store operations, arithmetic operations, Data load/store operations, arithmetic operations, ……

►► TradeTrade--off: optimization of one or two parameters at expense of some off: optimization of one or two parameters at expense of some othersothers�� Some examples :Some examples :

►► Execution time x amount of occupied memoryExecution time x amount of occupied memory►► Execution time x numerical precisionExecution time x numerical precision►► Execution time x energy budgetExecution time x energy budget

Page 12: Program Optimization Methodology

12

Software optimization processSoftware optimization process

1.1. Choice of algorithms and first Choice of algorithms and first ““basicbasic”” implementationimplementation

2.2. Identification of bottlenecksIdentification of bottlenecks�� Profiling techniques Profiling techniques -- collection of tools for estimation of collection of tools for estimation of metricsmetrics used for used for

optimizationoptimization►► Execution time Execution time

►► Memory accessMemory access

►► Number and type of instructionsNumber and type of instructions

►► Numbers of function callsNumbers of function calls

►► ……

�� Available toolsAvailable tools►► GprofGprof, , ValgrindValgrind, , ……

►► CCstudioCCstudio, , VTuneVTune

3.3. AlgorithmAlgorithm andand Source code Source code optimizationoptimization�� Optimization starts by the most demanding task (80Optimization starts by the most demanding task (80--20 rule !)20 rule !)

Repeat until the given implementation

constraints are satisfied

Page 13: Program Optimization Methodology

13

Software optimization process (II)Software optimization process (II)

►► GuidelinesGuidelines (time and memory optimization)(time and memory optimization)

1.1. For given algorithm, estimate the required performances with resFor given algorithm, estimate the required performances with respect to the pect to the given implementation constraintsgiven implementation constraints►► Number of operations per second and memory bandwidthNumber of operations per second and memory bandwidth

2.2. For present implementation, estimate manually or by profiling toFor present implementation, estimate manually or by profiling toolsols►► Number of operations per second (Number of operations per second (MOpsMOps), ), ►► Number of floating points operations (Flops)Number of floating points operations (Flops)►► Memory access bandwidth (Bytes per second)Memory access bandwidth (Bytes per second)

3.3. Considering the results of 1. and 2., Considering the results of 1. and 2., analyseanalyse the efficiency of hardware the efficiency of hardware resources utilizationresources utilization►► Specialized computing units: FPU, SIMD, Specialized computing units: FPU, SIMD, ……►► Memory organization and hierarchy, DMA accessMemory organization and hierarchy, DMA access►► Load balance of different computing unitsLoad balance of different computing units

4.4. Modify the algorithm or source code in order to exploit better tModify the algorithm or source code in order to exploit better the available he available hardwarehardware►► Reduce number of operations, modify data dependencyReduce number of operations, modify data dependency►► Use DMA data transfer (increases the available memory bandwidth)Use DMA data transfer (increases the available memory bandwidth)►► Exploit data and instruction parallelismExploit data and instruction parallelism

Page 14: Program Optimization Methodology

14

Application exampleApplication example

►► Automotive security: lane departure warning systemAutomotive security: lane departure warning system

�� Strict real time constraintsStrict real time constraints

�� Embedded system affected also by space occupation and power budgEmbedded system affected also by space occupation and power budget et constraintsconstraints

►► Lane detection by Hough transformLane detection by Hough transform

�� Prototyping methodologyPrototyping methodology

SimulinkSimulink prototype (demonstration)prototype (demonstration)

Application design in Application design in MatlabMatlab (demonstration and profiling example)(demonstration and profiling example)

Application transfer on DSP platform (objective of your project)Application transfer on DSP platform (objective of your project)

Optimization of the execution (objective of your project)Optimization of the execution (objective of your project)

Page 15: Program Optimization Methodology

15

Application profiling exampleApplication profiling example

►► MatlabMatlab implementation of line detection by Hough transformimplementation of line detection by Hough transform

Input

Garcia-Lorca versionof Deriche filter

Edge extraction

Houghtransform

Local maxima detection andline drawing

Output

0,087 s0,087 s<11Edge

extraction

1,7

1,9

3

75

100

%

0.122 s0.493 s1imread

0.134 s0.555 s1Local max and

lines

0.929 s0.929 s1Garcia-Lorca

2.978 s22.074 s1Hough

0.684 s29.046 s1If4arch

Self Time*Total TimeCallsFunction Name

Page 16: Program Optimization Methodology

16

EdgeEdge detectiondetection

►► PrinciplePrinciple ofof edgeedge detectiondetection operatoroperator chainchain

Original image

Gradient image Binary edges Edge closing

Page 17: Program Optimization Methodology

17

EdgeEdge detectiondetection problemproblem

Page 18: Program Optimization Methodology

18

Algorithm optimization levelAlgorithm optimization level

►► Edge detection Edge detection –– gradient gradient

►► Digital image (2D)Digital image (2D)

RP →Ζ2:

Page 19: Program Optimization Methodology

19

SobelSobel gradientgradient

►► 2 Masques :2 Masques :

VerticalVertical HorizontalHorizontal

►► Finite impulse response filterFinite impulse response filter

�� Other examples: Prewitt, RobertsOther examples: Prewitt, Roberts

►► Horizontal filterHorizontal filter

�� p(kp(k) denotes pixel p at position k, N is image width in pixels) denotes pixel p at position k, N is image width in pixels

gghh(k(k) = p(k) = p(k--N+1) N+1) -- p(kp(k--NN--1) + 2p(k+1) 1) + 2p(k+1) -- 2p(k2p(k--1) + p(k+N+1) 1) + p(k+N+1) -- p(k+Np(k+N--1)1)

►► Transfer functionTransfer function

SH(zSH(z) = ) = Gh(zGh(z) / (Z) / (Z--N+1N+1--z z --NN--11+2z +2z -- 2z2z--11+z +z N+1N+1--z z NN--1) 1) ) = 1 / [() = 1 / [(zz--zz --11) (z ) (z --NN+2+z+2+zNN)])]

10-1

20-2

10-1

121

000

-1-2-1

Band pass

Horizontal derivator

Low pass filter

Vertical smoothing

Band pass

Vertical derivator

Low pass filter

Horizontal smoothing

Page 20: Program Optimization Methodology

20

Gradient Gradient

►► Example: Example:

Original image Gradient image

Page 21: Program Optimization Methodology

21

DericheDeriche optimal edge detectoroptimal edge detector

►► Recursive filter (Infinite Impulse Response)Recursive filter (Infinite Impulse Response)

�� Any filter width obtained in constant time !Any filter width obtained in constant time !

►► Parameter Parameter αα defining the defining the «« widthwidth »» of filter, of filter,

�� tradetrade--off between the quality of detection and the precision of edge off between the quality of detection and the precision of edge

localizationlocalization

►► For larger For larger αα we obtain more edgeswe obtain more edges

►► For smaller For smaller αα we delete the less significant detailswe delete the less significant details

Original image alpha = 0.5alpha = 1

Page 22: Program Optimization Methodology

22

DericheDeriche edgeedge detectordetector (II)(II)

►► DericheDeriche filterfilter in Z in Z transformtransform�� SmootherSmoother

�� DerivatorDerivator

►► Standard Standard computingcomputing schemeschemeH – horizontal direction

V – vertical direction

IMx – allocated image

Page 23: Program Optimization Methodology

23

ParallelParallel organizationorganization ofof derivatorderivator

►► 2 2 polespoles �� 2 2 processingprocessing directions: causal et directions: causal et anticausalanticausal

►► AnticausalAnticausal pixel pixel readingreading requiresrequires entireentire image image lineline in in memorymemory --> 4 > 4 memorymemory lineslines usedused in in «« pingping pongpong »»

Dg Dd

Page 24: Program Optimization Methodology

24

DerivatorDerivator algorithm evaluationalgorithm evaluation

►► AlgorithmAlgorithm (1 ligne, 1 direction)(1 ligne, 1 direction)

2 +

2 x

3 +

3 x

5 ADD, 5 MUL

Number of operations

per pixel

Page 25: Program Optimization Methodology

25

7 ADD, 8 MUL

SmoothingSmoothing operatoroperator evaluationevaluation

►► SmootherSmoother hashas thethe samesame resolutionresolution as as derivatorderivator

�� 2 2 polespoles: causal en : causal en anticausalanticausal processingprocessing directiondirection

►► AlgorithmeAlgorithme

Page 26: Program Optimization Methodology

26

►► Transfer functionsTransfer functions�� GH(zGH(z)=D(z)L(z)=D(z)L(zNN)P(z))P(z)

�� GV(z)=GV(z)=D(zD(zNN)L(z)P(z)L(z)P(z))

�� order of operator computing has no importanceorder of operator computing has no importance

�� Intermediate storage has to ensure the data validity : Intermediate storage has to ensure the data validity : i.e. finish D before Li.e. finish D before L

►► Number of operations per pixel: 26 MUL et 24 ADDNumber of operations per pixel: 26 MUL et 24 ADD

►► Memory occupationMemory occupation�� 4 image memories + 16 memory lines4 image memories + 16 memory lines

►► Data typeData type�� Computing in floating point because of alphaComputing in floating point because of alpha

DericheDeriche edge detector in 2Dedge detector in 2D

Page 27: Program Optimization Methodology

27

Optimization of Optimization of DericheDeriche edge detectoredge detector

►► CalledCalled alsoalso GarciaGarcia--LorcaLorca optimizationoptimization

►► Objective: Objective: reductionreduction ofof occupiedoccupied memorymemory andand numbernumber ofof operationsoperations

►► Cascade Cascade ofof basic 1basic 1--D D operatorsoperators

�� PreservesPreserves complexitycomplexity but but reducesreduces memorymemory occupationoccupation

Page 28: Program Optimization Methodology

28

Optimisation GarciaOptimisation Garcia--LorcaLorca

►► New New derivatorderivator formulation formulation

Page 29: Program Optimization Methodology

29

Garcia Garcia LorcaLorca optimizationoptimization

►► DerivatorDerivator

Page 30: Program Optimization Methodology

30

Garcia Lorca Garcia Lorca optimizationoptimization

►► New smoother formulation (New smoother formulation (traptrapèèzeszes method)method)

►► DerivatorDerivator non recursive part = non recursive part = SobelSobel derivatorderivator

►► Smoother non recursive part = Smoother non recursive part = SobelSobel smoothersmoother

►► Factorization of non recursive partsFactorization of non recursive parts

Recursive part is equivalent to derivator

Page 31: Program Optimization Methodology

31

Garcia Lorca : optimisationGarcia Lorca : optimisation

►► New smoother and New smoother and derivatorderivator of Garcia of Garcia LorcaLorca

LGL(z)

Page 32: Program Optimization Methodology

32

►► GH(z)=Dgl(z)Lgl(zGH(z)=Dgl(z)Lgl(zNN)P(z))P(z)

►► GV(zGV(z)=Dgl(z)=Dgl(zNN)Lgl(z)P(z))Lgl(z)P(z)

►► Impose :Impose :

►► Result :Result :

►► Impose :Impose :

►► RhRh et Rv = local filters with maskset Rv = local filters with masks

-1 1

-1 1

-1 –1

1 1

Garcia Garcia LorcaLorca : gradient 2D: gradient 2D

Page 33: Program Optimization Methodology

33

Equivalent complexity of D and L; equivalent to LGL

-> gain of 50%

Garcia Garcia LorcaLorca : new organization of gradient 2D: new organization of gradient 2D

Page 34: Program Optimization Methodology

34

Garcia Garcia LorcaLorca smoother filter organizationsmoother filter organization

2nd order filter

Image

memory

L1

Input

Outp

ut

Horizontal direction

Vertical direction

Causal Anti-causal

)2(2)1((2)(.)1( 2)( +−++−= nynynxny γγγ)2(.)1(.2)(.)1()(22

−−−+−= nynynxny γγγ

Image

transposition

2nd order filter

2nd order filter L1

Causal Anti-causal

2nd order filter

Page 35: Program Optimization Methodology

35

Garcia Lorca 2D edge detector evaluationGarcia Lorca 2D edge detector evaluation

Local derivation: 4 +

9 MUL (if normalization constant applied)

12 ADD

Horizontal smoothing : 4 +, 4 x Vertical smoothing : 4 +, 5 x

Page 36: Program Optimization Methodology

36

Comparaison Comparaison

►► Original Original DericheDeriche filterfilter

�� Number of operations per pixel: Number of operations per pixel: 26 MUL et 24 ADD26 MUL et 24 ADD

�� Number of operations for 25 Number of operations for 25 frames per second (fps): frames per second (fps): ►► 25 fps 640x480 (VGA) => 384 25 fps 640x480 (VGA) => 384

MOPSMOPS

►► 25 fps 1920x1080 (HD) => 2,6 25 fps 1920x1080 (HD) => 2,6 GOPSGOPS

�� Memory occupationMemory occupation►► 4 image memories + 16 memory 4 image memories + 16 memory

lineslines

�� Data typeData type►► Computing in floating point because Computing in floating point because

of alphaof alpha

►► Garcia Lorca Garcia Lorca optimizationoptimization

�� Number of operations per pixel: 9 Number of operations per pixel: 9 MUL et 12 ADDMUL et 12 ADD

�� Number of operations for 25 Number of operations for 25 frames per second (fps): frames per second (fps):

► 25 fps 640x480 (VGA) => 161 MOPS

► 25 fps 1920x1080 (HD) => 1,1 GOPS

�� Memory occupationMemory occupation►► 2 image memories + 4 memory lines2 image memories + 4 memory lines

�� Data typeData type►► Computing in floating point because of Computing in floating point because of

gammagamma

Page 37: Program Optimization Methodology

37

DericheDeriche filterfilter: : precisionprecision studystudy

►►Unique constant : Unique constant : γγ

�� CodingCoding ofof γγ definesdefines thethe numbernumber ofof possible possible filtersfilters

►►i.e. 3 bits : i.e. 3 bits : γγ = = γγ1122--11 + + γγ22--22 + + γγ33--33 resultsresults 7 7

differentdifferent filtersfilters

�� ComputingComputing precisionprecision

►►To To ensureensure thethe stabilitystability ofof systemsystem: :

21 bits for 21 bits for computingcomputing, , storagestorage in 12 bitsin 12 bits

Page 38: Program Optimization Methodology

38

HoughHough transformtransform

►► Object recognition technique (1962) Object recognition technique (1962)

►► Parametric description : lines, circles, ellipsesParametric description : lines, circles, ellipses

►► Good noise robustnessGood noise robustness

►► Allows detecting the partially overlapping objectsAllows detecting the partially overlapping objects

Page 39: Program Optimization Methodology

39

Hough transform principleHough transform principle

►► Lines detectionLines detection

x

y

“image” space “parameters” space

a

b

y = ax + b

Page 40: Program Optimization Methodology

40

Hough transform: space (Hough transform: space (ρρ, , θθ))

x

y

θ

ρ

θ

ρ

x = ρ cos(θ)y = ρ sin(θ)

x cos(θ) + y sin(θ) – ρ = 0

Sinusoids, corresponding to one point on the line, have an intersection at the parameters representing the line.

Page 41: Program Optimization Methodology

41

Hough transformHough transform

Original image

3D view of Hough accumulator2D view of parameters space

Hough accumulator

Page 42: Program Optimization Methodology

42

Standard algorithm: initializationStandard algorithm: initialization

►► Allocate and Allocate and intializeintialize to 0 Hough accumulator space to 0 Hough accumulator space A(A(ρρ, , θθ))�� Define the precision of parameters Define the precision of parameters ρρ, , θθ

�� Define the intervals of parameters Define the intervals of parameters

►► Angle :Angle :

�� θθ ∈∈ [0[0°°, 180, 180°°]]

►► Distance to origin Distance to origin

�� ρρ ∈∈ [0, d], d = length of image diagonal[0, d], d = length of image diagonal

Page 43: Program Optimization Methodology

43

Standard Standard algorithmalgorithm

►► Input: binary image P, containing object Input: binary image P, containing object contourscontours

►► For all points For all points p(x,yp(x,y)>0 :)>0 :�� For all For all θθ ∈∈ [0, 180] compute [0, 180] compute ρρ ::

►► ρρ = x = x cos(cos(θθ) + y ) + y sin(sin(θθ))►► A(A(ρρ, , θθ) = ) = A(A(ρρ, , θθ) + 1) + 1

�� End forEnd for

►► End forEnd for

Ib = P; [m,n]=size (Ib);m2 = round(m/2);n2 = round(n/2);maxrho = (ceil(sqrt(m^2 + n^2)));% Accumulator initializationA = zeros (180,maxrho);% Compute Hough transform for every Ib(x,y)>0for i=2:m-1,

for j=2:n-1,if Ib(i,j) > 0,

for angle = 1:180,rho = (j)*cosd(angle) + (i)*sind(angle);rho_idx = round(rho/2 + maxrho/2);if rho_idx > 0, A(angle,rho_idx) = A(angle,rho_idx) + 1;end;

end;end;

end;end;

Page 44: Program Optimization Methodology

44

Standard algorithm evaluationStandard algorithm evaluation

►► Arithmetic complexityArithmetic complexity�� Assumption : Assumption : θθ considered with precision considered with precision ∆∆ = 1= 1

→ → for for ∀ ∀ p(x,yp(x,y) ) > 0> 0 compute 180 times compute 180 times ρρ = x = x cos(cos(θθ) + y ) + y sin(sin(θθ))

→→ if 20% pixels > 0 if 20% pixels > 0

25 fps 640*480 => 829 MOPS + 276x10^6 25 fps 640*480 => 829 MOPS + 276x10^6 sin,cossin,cos

25 fps 1920*1080 => 5,6 GOPS + 1,7x10^9 25 fps 1920*1080 => 5,6 GOPS + 1,7x10^9 sin,cossin,cos

�� Remarks : Remarks :

►► 180 random access per pixel180 random access per pixel

►► 180 sine and cosine function call per pixel180 sine and cosine function call per pixel

Page 45: Program Optimization Methodology

45

O'Gorman et O'Gorman et ClowesClowes OptimizationOptimization

►► Principle : use local gradient information to reduce the Principle : use local gradient information to reduce the number of angles to evaluatenumber of angles to evaluate�� Local gradient direction gives a good estimation of Local gradient direction gives a good estimation of θθ

x

y g

gx

gy

θ

Page 46: Program Optimization Methodology

46

►► For For allall p(x,y) p(x,y) > 0> 0►► θθ = = arctanarctan ((ggyy//ggxx) ) ►► ρρ = x cos(= x cos(θθ) + y sin() + y sin(θθ))►► A(A(ρρ, , θθ) = A() = A(ρρ, , θθ) + 1) + 1

►► ArithmeticArithmetic complexitycomplexity�� 1 computation 1 computation ofof sinesine andand cosinecosine perper pixelpixel�� TradeTrade--offoff: 1 division : 1 division andand 1 1 arctanarctan

�� si 20% pixels de lsi 20% pixels de l’’image image àà calculer : calculer : 25 25 fpsfps 640*480 : 6,1 MOPS + 1,5x10^6 sin,cos,640*480 : 6,1 MOPS + 1,5x10^6 sin,cos,arctanarctan25 25 fpsfps 1920*1080 : 5,6 GOPS + 10,4x10^6 sin,cos,1920*1080 : 5,6 GOPS + 10,4x10^6 sin,cos,arctanarctan

►► RemarksRemarks ::�� 1 1 memorymemory accessaccess perper pixelpixel�� BetterBetter identification identification ofof HoughHough «« peakspeaks »»

O'Gorman et O'Gorman et ClowesClowes OptimizationOptimization

Page 47: Program Optimization Methodology

47

SomeSome complementscomplements

►► Trigonometric functions Trigonometric functions �� Mathematical librariesMathematical libraries

►► Not always optimum in the terms of execution timeNot always optimum in the terms of execution time

►► Not always matching with required precisionNot always matching with required precision

�� Some other possibilitiesSome other possibilities

►► LUTLUT

►► CORDICCORDIC

Page 48: Program Optimization Methodology

48

LUTLUT

►► LUT is a table of correspondence associating one output LUT is a table of correspondence associating one output with one inputwith one input

►► LUT allows replacing complex computation by only 1 LUT allows replacing complex computation by only 1 memory accessmemory access

►► PrinciplePrinciple�� PrePre--compute the function values with given precision for given compute the function values with given precision for given

interval of valuesinterval of values

�� Create a memory structure containing these valuesCreate a memory structure containing these values

�� Could be done automatically at the beginning of the programCould be done automatically at the beginning of the program

►► TradeTrade--off between LUT size and precisionoff between LUT size and precision

Page 49: Program Optimization Methodology

49

LUT LUT

►► DiscretizeDiscretize given interval 0 <= x <= 180 with defined precisiongiven interval 0 <= x <= 180 with defined precision

�� If precision If precision ∆∆ = 1, the values of degrees can represent the table index= 1, the values of degrees can represent the table index

►► Define values precision (Define values precision (intint, float, , float, ……))

►► Allocate LUT memory [180/ Allocate LUT memory [180/ ∆∆]]

►► Initialize the tableInitialize the table

�� Pour i=0, i<=360, i=i+ Pour i=0, i<=360, i=i+ ∆∆ fairefaire

►► LUT[iLUT[i] = ] = sin(isin(i))

Page 50: Program Optimization Methodology

50

CORDICCORDIC

►► CORDIC = CORDIC = COordinateCOordinate Rotation Rotation DIgitalDIgital Computer (1971)Computer (1971)�� Popular in FPGA implementationPopular in FPGA implementation

►► Principle : recursive computing, the number of iterations determPrinciple : recursive computing, the number of iterations determines result ines result precisionprecision�� By successive rotations of vector v,By successive rotations of vector v,

we search its coordinates we search its coordinates x,yx,yon unitary circleon unitary circle

►► Advantages : minimizes memory occupationAdvantages : minimizes memory occupation

Page 51: Program Optimization Methodology

51

CORDICCORDIC

►► ExampleExample (C)(C)

Page 52: Program Optimization Methodology

52

ReferencesReferences

1.1. Donald E. Knuth, Structured Programming with go to Statement, CoDonald E. Knuth, Structured Programming with go to Statement, Computing Surveys, Vol. 6, No. 4, mputing Surveys, Vol. 6, No. 4, December 1974 (December 1974 (http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261--knuth.pdfknuth.pdf))

2.2. R. R. DericheDeriche, Recursively implementing the Gaussian and its derivatives, Tec, Recursively implementing the Gaussian and its derivatives, Technical report, INRIA, Nhnical report, INRIA, N°°RRRR--1893, 1993 (1893, 1993 (http://hal.inria.fr/docs/00/07/47/78/PDF/RRhttp://hal.inria.fr/docs/00/07/47/78/PDF/RR--1893.pdf1893.pdf))

3.3. T. Ea, L. T. Ea, L. LacassagneLacassagne, P. , P. GardaGarda, Execution temps reel des , Execution temps reel des detecteursdetecteurs de contours de de contours de DericheDeriche par des par des processeursprocesseurs RISC, RISC, CongrCongrèèss AdAdééquationquation AlgorithmeAlgorithme Architecture, 1998, France (Architecture, 1998, France (http://www.ief.uhttp://www.ief.u--psud.fr/~lacas/Publications/AAA98.pdfpsud.fr/~lacas/Publications/AAA98.pdf))

4.4. F. F. LohierLohier, L. , L. LacassagneLacassagne, P. , P. GardaGarda, Programming techniques for real time software implementation , Programming techniques for real time software implementation of optimal edge detectors: a comparison between state of the Artof optimal edge detectors: a comparison between state of the Art DSP and RISC architectures", DSP DSP and RISC architectures", DSP World, 1998 (World, 1998 (http://www.ief.uhttp://www.ief.u--psud.fr/~lacas/Publications/DSPWorld98.pdfpsud.fr/~lacas/Publications/DSPWorld98.pdf))

5.5. D. D. DemignyDemigny, F. G. Lorca, L Kemal et J.P , F. G. Lorca, L Kemal et J.P CocquerezCocquerez, Conception nouvelle du d, Conception nouvelle du déétecteur de contours tecteur de contours de de DericheDeriche, Symposium on signal , Symposium on signal andand Image Image ProcessingProcessing, GRETSI, 1995 , GRETSI, 1995 ((http://documents.irevues.inist.fr/bitstream/handle/2042/2030/006http://documents.irevues.inist.fr/bitstream/handle/2042/2030/006.PDF%20TEXTE.pdf?sequence=1.PDF%20TEXTE.pdf?sequence=1))

6.6. Gen, M.; Cheng, R. (2002), Genetic Algorithms and Engineering OpGen, M.; Cheng, R. (2002), Genetic Algorithms and Engineering Optimization, New York: Wiley timization, New York: Wiley 7.7. http://users.ecs.soton.ac.uk/msn/book/new_demo/hough/http://users.ecs.soton.ac.uk/msn/book/new_demo/hough/8.8. http://homepages.inf.ed.ac.uk/rbf/HIPR2/hough.htmhttp://homepages.inf.ed.ac.uk/rbf/HIPR2/hough.htm9.9. FengFeng Zhou , Peter Zhou , Peter KornerupKornerup, A High Speed Hough Transform Using CORDIC (1995), In Internati, A High Speed Hough Transform Using CORDIC (1995), In International onal

Conference on Digital Signal Processing (DSPConference on Digital Signal Processing (DSP’’95) 95) ((http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.3209http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.3209))

10.10. F O'Gorman, M F O'Gorman, M ClowesClowes, Finding picture edges through , Finding picture edges through collinearitycollinearity of feature points (1973), Third of feature points (1973), Third International Joint Conference on Artificial Intelligence International Joint Conference on Artificial Intelligence ((http://www.ijcai.org/Past%20Proceedings/IJCAIhttp://www.ijcai.org/Past%20Proceedings/IJCAI--73/PDF/058.pdf73/PDF/058.pdf))