60
Carnegie Mellon 12/5/2002 IBM-Thomas T. J. Watson Res. Center SPIRAL: SPIRAL: Tuning DSP Transforms to Tuning DSP Transforms to Computing Platforms Computing Platforms Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Franz Franchetti (TU Vienna) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral José M. F. Moura

SPIRAL: Tuning DSP Transforms to

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRAL: SPIRAL: Tuning DSP Transforms to Tuning DSP Transforms to Computing PlatformsComputing Platforms

• Jeremy Johnson (Drexel)• Robert Johnson (MathStar Inc.)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)

• Franz Franchetti (TU Vienna)• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Jianxin Xiong (UIUC)

Faculty Students

http://www.ece.cmu.edu/~spiral

José M. F. Moura

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SponsorSponsor

Work supported by DARPA (DSO), Applied & Computational

Mathematics Program, OPAL, through grant managed by

research grant DABT63-98-1-0004 administered by the Army

Directorate of Contracting.

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Moore’s Moore’s Law and Law and High(estHigh(est)) Performance Performance Scientific ComputingScientific Computing

arithmetic cost model not accurate for predicting runtime (one cache miss = 10 floating point ops)better performance models hard to getbest code is machine dependent (registers/caches size, structure)hand-tuned code becomes obsolete as fast as it is writtencompiler limitationsfull performance requires (in part) assembly coding

Moore’s Law: processor-memory bottleneckshort life cycles of computersvery complex architectures

• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features

(single processor, off-the-shelf)

Effects on software/algorithms:

Portable performance requires automation

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

AutomaticAutomatic Performance Tuning: ResearchPerformance Tuning: Research

Linear Algebra:ATLAS (J. Dongarra et al.) LAPACKPhiPACK (J. Demmel et al.)

Signal Processing: FFTW (M. Frigo and S. Johnson)SPIRAL

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRALSPIRALAutomates

cuts development costscode less error-prone

takes advantage of architecture specific featuresporting without loss of performance

systematic exploration of alternatives both at algorithmic and code level

are performance critical

Implementation

Platform-Adaptation

Optimization

of DSP algorithms

A library generator for highly optimized signal processing algorithms

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRALSPIRAL ApproachApproach

DSP Transform(DFT, DCT, Wavelets etc.)

Computing Platform

given

given

PossibleImplementations

PerformanceEvaluation

Inte

llig

ent

Sea

rchPossible

Algorithms

SP

IRA

L S

earc

h S

pac

e

(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )

adaptedimplementation

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DSPDSP Algorithms: Example 4Algorithms: Example 4--point DFTpoint DFT

Cooley/Tukey FFT (size 4):

product of structured sparse matricesmathematical notation

−−

=

−−−−−−

1000

0010

0100

0001

1100

1100

0011

0011

000

0100

0010

0001

1010

0101

1010

0101

11

1111

11

1111

iii

ii

4222

42224 )()( LDFTITIDFTDFT ⋅⊗⋅⋅⊗=

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DSPDSP Algorithms: TerminologyAlgorithms: Terminology

Transform

Rule

Formula

( ) ( ) PDFTIDIDFTDFT mnmnnm ⋅⊗⋅⋅⊗→

( ) ( )( ) PFIIDIFDFT ⋅⊗⊗⋅⋅⊗= 222428

parameterized matrix

• a breakdown strategy • product of sparse matrices

• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation

nDFT

Ruletree 8DFT

2DFT 4DFT

2DFT 2DFT

• few constructs and primitives• uniquely defines an algorithm• can be translated into code

DFT is symmetric ⇒ transpose the rule:

CT rule transposed

Commuting tensor product factorsA and B square size m and n

Commutation property ⇒ further variations

Different patterns for access, storage and flow of data

( )mn mn

n mB A L A B L⊗ = ⊗

( ) ( )( ) ( )

RS RS RS RS

N S S R R S R S R

RS RS

N R S S S S R

F L I F L T I F L

F F I T L F I

= ⊗ ⊗

= ⊗ ⊗

( ) ( )RS RS

RS S R S S R SF L I F T F I= ⊗ ⊗

( ) ( )0 0

1 12 2 2 2

2 2

3 3

1 0 1 0 1 1 0 00 1 0 1 1 1 0 01 0 1 0 0 0 1 10 1 0 1 0 0 1 1

x xx xxF I I Fx xx x

−−

− −

= =⊗ ⊗

MoreMore CooleyCooley--Tukey RulesTukey Rules

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DSPDSP TransformsTransforms

( )[ ]nkliDFT n /2exp π=

222 DFTDFTWHT k ⊗⊗=

( )( )[ ]nlkDCT nII /2/1cos)( π+=

( )( )[ ]nlkDCT nIV /)2/1(2/1cos)( π++=

( )[ ]nklDST nI /sin)( π=

( )[ ]nlnkMDCT nn /)2/1)(2/)1((cos2 π+++=×

Others: filtering, Haar, Hartley, …

discrete Fourier transform

Walsh-Hadamard transform

discrete cosine and sineTransforms (16 types)

modified discrete cosine transform

TT ⊗two-dimensional transform

( ) ( )H

Wl

n

n 22

2

2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=discrete wavelet transform

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

RulesRules = Breakdown Strategies= Breakdown Strategies

( ) ( )Qn

IVn

IIn

IIn FIDCTDCTPDCT 22/

)(2/

)(2/

)( ⊗⋅⊕⋅→

DDCTSDCT IIn

IVn ⋅⋅→ )()(

( ) 2)(

2 2/1,1 FdiagDCT II ⋅→

rIV

n MMDCT 1)( →

1 1 12 2 2 21

( )n n n n n ni i i t

n

i

WHT I WHT I+ + + +− +

=→ ⊗ ⊗∏ … …

nnn SinDFTjCosDFTDFT ⋅+→

)(4/2/

IInnn DCTCosDFTCosDFT →

)(4/2/

IInnn DCTSinDFTSinDFT →

( ) ( ) PDFTIDIDFTDFT mnmnnm ⋅⊗⋅⋅⊗→

CDSTDCTBDFT In

Inn ⋅⊕⋅→ )( )(

2/)(2/

PDCTSMDCT IVnnn ⋅⋅→×

)(2

base case

recursive

translation

iterative

recursive

recursive

recursive

recursive

recursive

translation

iterative/recursive

( ) ( )H

Wl

n

n 22

2

2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Partition TreesPartition Trees

4

1 1 1 14

1 3

1 11

4

1 3

1 2

11iterative

mixed

recursive

Each WHT algorithm can be represented by a tree,

where a node labeled by n corresponds to nWHT2

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

AlgorithmsAlgorithms = Ruletrees = Formulas= Ruletrees = Formulas)(

8IIDCT

)(4

IIDCT)(

4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT ⊗⋅⊕⋅→

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

22

1FDCT II ⋅→

)(2

IIDCT

)(2

IIDST

)(2

IVDCT

)(2

IIDST

)(2

IIDCT

R1R6 SDCTPDCT II

nIV

n ⋅⋅→ )()(

)(4

IIDCT)(2

IVDCT

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Mathematical Mathematical Framework for Filters Framework for Filters and Discreteand Discrete--Time Wavelet TransformsTime Wavelet Transforms

Overlapped direct sum of matrices A (m x n) and B (r x s)

Overlapped tensor product

column overlap

row overlap

A

Bv=⊕ BA v

( ) ( )Svn IIBA ⊕⊕=

AB

v

=⊕ BA v ( )( )BAII ⊕⊕= rv

m

foldk

vvvvk

⊕⊕⊕=⊗ BBBBI

foldk

vvvvk

⊕⊕⊕=⊗ BBBBI

column overlap

row overlap

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

PropertiesProperties of Overlapped Direct Sums of Overlapped Direct Sums and Tensor Productsand Tensor Products

( ) ( ) x x x 1 1

I B B I I B I Ikk

k v m n m n v n k m n k v nl l= =

⊗ = ⊕ ⊕ = ⊗ ⊗ i i

Column and row overlap direct sum

Column and row overlapped tensor product

AB

v

=⊕ BA v

T

=AT

BTv ( )A B

TT Tv= ⊕

( )Tvn s n v s⊕ = ⊕I I I I

( ) ( )

( ) ( )

x x x 1 11

x

I B I B I I B

I I I B

T Tk k kT Tvk m n v m m n k v m m n

l ll

vk m k m n

= ==

⊗ = ⊕ ⊕ = ⊗ ⊕

= ⊗ ⊗

i i

i

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

FiltersFiltersLinear convolution in matrix form

Block convolutionsOverlap-save rule

Overlap-add rule

Circular convolution rule for circulant K

( ) [ ]T110

1I −−⊗= l

lnn hhhhC

( ) ( ) ( )( )hChC bbnlbl

n/bn ⊗⊗= −+−

/11 III

( ) ( )( )( ) 1,111T

/ III −−−+−⊗⊗= lllblm/bbbmn EhChC m=n+l-1

( ) ( ) nnn hhK DFTDFTdiagDFT 1 ⋅⋅⋅= −h-first column of K

n-length of h

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DiscreteDiscrete--TimeTime Wavelet TransformWavelet Transform

DTWT scaling and wavelet expansion coefficients

Recursive algorithm (Mallat)

( )( ) 1,...,02

2

1,

,0

−=−=

−=

−−∑

∑Jjnkhxd

nkhxc

jJjJ

nnkj

JJ

nnk

scaling coefficients

wavelet coefficients

h(-n)

h1(-n)

2

2

h(-n)

h1(-n)

2

2cJ,k

cJ-1,k

dJ-1,k

cJ-2,k

dJ-2,k

VJ-1

VJ-2

VJ WJ-1

WJ-2

1

, 2 1, , 2 1,j k m k j m j k m k j mm m

c h c d h c− + − += =∑ ∑

Haar wavelets = square waves

First stage: V2 ⇒V1⊕W1

The process is repeated for the upper half of the output

11 1

2 2[ 1, 1], [ 1, 1]h h= = −

( )4 4

2 2 2 21

21

1 1 0 0 1 1 0 0

0 0 1 1 1 1 0 0

1 1 0 0 0 0 1 1

0 0 1 1 0 0 1 1

H L L I Fc

H cd

−−

− −

= = ⊗

= =

( ) ( )1 1 1 1

2

2 2 22 2 2 22,

n

n n n nn

H

HTHT I L I F HT F− − − −= ⊕ ⊗ =

HaarHaar Wavelets Wavelets –– ExampleExample

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DiscreteDiscrete--TimeTime Wavelet TransformWavelet TransformDiscrete-Time Wavelet Transform (DTWT) rule

Scaling (lowpass) and wavelet (highpass) filter coefficients

DTWT - convolution rule

( ) ( )H

Wl

n

n 22

2

2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=

[ ]( ) ( ) ( )( )([ ]( ) ( ) ( )( ) )

⋅⋅⊕⋅⊗

⊕⋅⊕⋅⊗=

nn

mmm

nmmm

heChoC

leCloCH

I1

1LI11

LI11

2T

2/T

2/2/

2T

2/T

2/2/

′′′

=−

110

110

l

l

hhh

hhhW

lo-lowpass odd coeffs., le-lowpass even coeffs.

ho-highpass odd coeffs., he-highpass even coeffs.

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DTWTDTWT RuletreeRuletree –– ExampleExample

DTWTn

DTWTn/2 C(lo) Cm/2(le) C(ho) C(he)

DTWT-convolution rule

DTWTn/4filters

Overlap-add rule

Cb(le)

Circulant rule

K(le0)

Circular convolution rule

1DFT−s sDFT

m=n+l-1

le0=le padded with b-1 zeroes

s=l/2+b-1

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

FormulaFormula for a DCT, size 16for a DCT, size 16

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

HelpfulHelpful ConceptConcept

DSP Transforms Formal Languages

DSP transform (of size)

Rule

Formula/Algorithm

Non-terminal symbol (with attribute)

Rule (production)

Element in Language (only terminals)

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

MathematicalMathematical Framework: SummaryFramework: Summary

fast algorithms represented as ruletrees (easy generation/manipulation)and as formulas (can be translated into code)

formulas built from few constructs and primitives

many different algorithms/formulas generated from few rules (combinatorial explosion)

these algorithms are (essentially) equal in arithmetic cost, but differ in data flow

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

FormulaFormula GenerationGeneration

recursive application

Formula Generator

data base (extensible!)

data type

rules control

transforms formulasruletrees

translation

search engine formula translation

(spl compiler)export

runtime

written in GAP/AREP (computer algebra system)

all computation/manipulation is symbolic

exact arithmetic

easy extensible rule and transform data base

verification of rules and formulas

cut here for otheroptimization problems

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Formula Generator: SummaryFormula Generator: Summary

Concept:

Implementation:

• easy extensible• formulas as ruletrees (efficient representation, easy manipulation)• interface to spl• allows implementation of search methods

• efficient implementation moves bottleneck to spl compiler• fully symbolic derivation/manipulation of formulas• verification• tools to analyze structure, arithmetic cost of formulas• interfaces with search engine

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

FormulasFormulas in SPLin SPL

( compose( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) )( permutation ( 1 3 4 2 ) )( tensor

( I 2 )( F 2 )

)( permutation ( 1 4 2 3 ) )( direct_sum

( compose( F 2 )( diagonal ( 1 sqrt(1/2) ) )

)( compose

( matrix( 1 1 0 )( 0 (-1) 1 )

)( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) )( matrix( 1 0 )( 1 1 )( 0 1 )

)( permutation ( 2 1 ) )

• • • •

• • • •

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPLSPL Syntax (Subset)Syntax (Subset)matrix operations:

(compose formula formula ...)(tensor formula formula ...)(direct_sum formula formula ...)

direct matrix description:(matrix (a11 a12 ...) (a21 a22 ...) ...)(diagonal (d1 d2 ...))(permutation (p1 p2 ...))

parameterized matrices:(I n)(F n)

scalars:1.5, 2/7, cos(..), w(3), pi, 1.2e-04

definition of new symbols:(define name formula)(template formula (i-code-list)

directives for code generation#codetype real/complex#unroll on/off

allows extension of SPL

controls loop unrolling

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPLSPL Compiler, 4Compiler, 4--point FFTpoint FFT

(compose (tensor (F 2) (I 2)) (T 4 2)(tensor (I 2) (F 2)) (L 4 2))

f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4

r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6

fast algorithmas

formulaas

SPL program#codetype

complex real

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPLSPL Compiler: SummaryCompiler: Summary

Parsing

Intermediate Code Generation

Intermediate Code Restructuring

Target Code Generation

Symbol Table Abstract Syntax Tree

I-Code

I-Code

C, FORTRAN function

Template Table

SPL Formula Template DefinitionSymbol Definition

OptimizationI-Code

SPL Program

Built-in optimizations:

• single static assignment code• no reuse of temporary vars• only scalar temporary vars• constants precomputed

Extensible through templates

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

TemplatesTemplates

(template(F n)[ n >= 1 ]( do i=0,n-1

y(i)=0do j=0,n-1y(i)=y(i)+W(n,i*j)*x(j)

endend ))

Pattern

I-code

Condition

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Code Generation and Template MatchingCode Generation and Template Matching

(F 2) matches pattern (F n) and assigns 2 to n.Because n=2 satisfies the condition n>=1,the following i-code is generated from the template:

do i = 0,1y(i) = 0do j = 0,1

y(i) = y(i)+W(2,i*j)*x(j)end

end

Y(0)=x(0)+x(1)y(1)=x(0)-x(1)

Unrolling & Optimization

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SIMDSIMD Short Vector ExtensionsShort Vector Extensions

+ x

vector length = 4(4-way)

Extension to instruction set architectureAvailable on most current architectures (SSE on Pentium, AltiVec on Motorola G4)Originally for multimedia (like MMX for integers)Requires fine grain parallelismLarge potential speed-up

SIMD instructions are architecture specificNo common API (usually assembly hand coding)Performance very sensitive to memory accessAutomatic vectorization very limited

Problems:

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Vector Vector Code Generation from SPL FormulasCode Generation from SPL Formulas

Naturally vectorizable construct

A

x y

4IA ⊗vector length

iiii

k

ii QEIADP )(

1υ⊗∏

=

Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length

(Current) generic construct completely vectorizable:

Vectorization in two steps:

1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPL SPL Compiler: Vector Code GenerationCompiler: Vector Code Generation

( ) ( )( )⋅′⋅⊗⋅⊗=16444

84416 TIDFTLIDFT

( ) ( ) 16444

1644416 LDFTITIDFTDFT ⋅⊗⋅⋅⊗=

( )( )( ) ( ) ( )( )82444

8442

164

824 LIIDFTLIILLI ⊗⋅⊗⋅⊗⊗⊗

Symbolic vectorization(automatic formula manipulation)

Mapping to C code + vector API

LOAD_VECT(xl0, x + 0);LOAD_VECT(xl4, x + 16);f0 = SIMD_SUB(xl0, xl4);LOAD_VECT(xl1, x + 4);LOAD_VECT(xl5, x + 20);f1 = SIMD_SUB(xl1, xl5);...yl7 = SIMD_SUB(f1, f4);STORE_L_8_4(yl6, yl7, y + 24);yl2 = SIMD_SUB(f0, f5);yl3 = SIMD_ADD(f1, f4);STORE_L_8_4(yl2, yl3, y + 8);

SSESSE2AltiVec…

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

WhyWhy Search?Search?Toy problem

DCT IV- size 24

~31000 formulas

Search in algorithm space in SPIRAL:Exhaustive & Random Search, DP, Hill climbing, Genetic AlgorithmsBeyond search: design of optimal tree

DCT IV- size 24

~31000 formulas

scheduled

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

WhyWhy Search?Search?

DCT, type IV, size 16

• maaaany different formulas• large spread in runtimes, even for modest size• precisely equal arithmetic cost• best formula is platform-dependent

~31000 formulas

Toy problem:scheduled

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

NumberNumber of Formulas/Algorithmsof Formulas/Algorithms

k

123456789

# DFT, size 2^k

16

40296

27744162570361280~1.01 • 10^27~2.31 • 10^61

~2.86 • 10^133

# DCT-IV, size 2^k

110

12631242

19244433627343815121631354242

~1.07 • 10^38~2.30 • 10^76

~1.06 • 10^153

differ in data flow not in arithmetic cost exponential search space

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SearchSearch Methods Available in SPIRALMethods Available in SPIRAL

Exhaustive SearchDynamic Programming (DP)Random SearchHill ClimbingSTEER (similar to a genetic algorithm)

Good100s-1000sAllHill Climbing

(very) good100s-1000sAllSTEER

fair/goodUser decidedAllRandom

(very) good10s-100sAll DP

BestAllVery smallExhaust

ResultsTimedSizesFormulasPossible

Search over • algorithm space and • implementation options (degree of unrolling)

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

STEERSTEERPopulation n:

Population n+1:

……

……

Mutation

Cross-Breeding expanddifferently

swapexpansions

Survival of Fittest

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

LearningLearning to Generate Fast Algorithmsto Generate Fast Algorithms

• Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Some Experimental ResultsSome Experimental Results

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

DCTDCT Type IV Size 16Type IV Size 16

Fastest Found Formulas Number of Formulas Timed

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Experimental Experimental ResultsResults

high performance code(compared with FFTW)

different transforms

search methods(applicable to all transforms)

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Some Experimental ResultsSome Experimental Results

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

GeneratedGenerated DFT Code: Pentium 4, SSEDFT Code: Pentium 4, SSE(P

seu

do

) g

flo

ps

DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (to C code) up to factor of 3.1

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13

Spiral SSEIntel MKL interl.FFTW 2.1.3Spiral CSpiral C vectSIMD-FFT

hand-tuned vendor assembly code

compiler vectorization

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

GeneratedGenerated DFT Code: Pentium 4, SSE2DFT Code: Pentium 4, SSE2g

flo

ps

DFT 2n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (to C code) up to factor of 1.8

0

0.5

1

1.5

2

2.5

3

3.5

4 5 6 7 8 9 10 11 12

Intel MKL splitSpiral SSE2FFTW 2.1.3Spiral CSpiral C vectSpiral SSE2 exh.

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

GeneratedGenerated DFT Code: Pentium III, SSEDFT Code: Pentium III, SSEg

flo

ps

DFT 2n single precision, Pentium III, 1 GHz, using Intel C compiler 6.0n

speedups (to C code) up to factor of 2.1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

4 5 6 7 8 9 10 11 12 13

Intel MKLSpiral SSESpiral C vectSpiral CFFTW 2.1.3

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Other Other transformstransforms

0

0.5

1

1.5

2

2.5

3

3.5

4 5 6 7 8 9 10 11 12 13 14

Spiral floatSpiral C vectSpiral C SSE

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

4x4 8x8 16x16 32x32 64x64 128x128

Spiral SSESpiral C vectSpiral C float

gfl

op

s

transform size transform size

2-dim DCT 2n x 2n

Pentium 4, 2.53 GHz, SSEWHT 2n

Pentium 4, 2.53 GHz, SSE

speedups (to C code) up to factor of 3

WHT has only additionsvery simple transform

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

Best Best DFT Trees, size DFT Trees, size 210 = 1024= 1024

scalar

C vect

SIMD

Pentium 4float

Pentium 4double

Pentium IIIfloat

AthlonXPfloat

10

6

4

4

10

87

5

10

8

6

4

2

2

2

2 2

2 2

2 2

2

21

2

2 3

10

8

5

2

2

2 3

10

97

5

12

22 3

10

6

2

4

2 2

2 2

4

10

4

2

6

2 4

2 2

2

10

5

2

5

2 3 3

10

6

3

4

2 2 3

10

8

5

2

2

2 3

10

6

4

4

2 2

2 2

2

10

5

2

5

2 3 3

trees platform/datatype dependent

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

CrosstimingCrosstiming of best trees on Pentium 4of best trees on Pentium 4S

low

dow

n f

acto

r w

.r.t

. bes

t

DFT 2n single precision, runtime of best found of other platforms

n

software adaptation is necessary

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

4 5 6 7 8 9 10 11 12 13

Pentium 4 SSEPentium 4 SSE2AthlonXP SSEPentiumIII SSEPentium 4 float

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRAL SPIRAL SystemSystem

DSP transformspecifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch

En

gin

e

runtime (performance)

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcodeS P

I R

A L

(or an

espresso

for sm

all transfo

rms)

platform-adaptedimplementation

comes back

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRALSPIRAL SystemSystem

Available for download (v3.1): www.ece.cmu.edu/~spiral

Easy installation (Unix: configure/make; Windows: install shield)

Unix/Linux and Windows 98/ME/NT/2000/XP

Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,

MDCT, Filters, Wavelets, Toeplitz, Circulants

Extensible

New version (4.0) in preparation

~ 30 Publications in the areas of Signal Processing, High

Performance Computing, Compilers, Machine Learning,

Mathematics

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

SPIRALSPIRAL System: SummarySystem: Summary

Available for download: www.ece.cmu.edu/~spiral

Easy installation (Unix: configure/make; Windows: install shield)

Unix/Linux and Windows 98/ME/NT/2000/XP

Current transforms: DFT, DHT, WHT, DCT/DST type I – IV,

MDCT, Filters, Wavelets, Toeplitz, Circulants

Extensible

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

OrganizationOrganization

Mathematical Framework

Formula Generator

SPL and SPL Compiler

Search Engine

SPIRAL system

Conclusions

Transforms, Rules, and Formulas

Transform → Algorithm

Algorithm → Implementation

How to find the best implementation

Everything taken together

Car

negi

e M

ello

n

12/5/2002 IBM-Thomas T. J. Watson Res. Center

ConclusionsConclusions

Mathematical computer representation of algorithmsAutomatic translation of algorithms into code

SPIRAL closes the gap between math domain (algorithms) andimplementation domain (programs)

High level: Mathematical manipulation of algorithmsLow level: Coding degrees of freedom

SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives

a new paradigm of software development