25
Performance Analysis Performance Analysis O O f f Generics Generics I I n Scientific Computing n Scientific Computing Laurentiu Dragan Stephen M. Laurentiu Dragan Stephen M. Watt Watt Ontario Research Centre for Computer Ontario Research Centre for Computer Algebra Algebra University of Western Ontario University of Western Ontario SYNASC 2005 SYNASC 2005

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Embed Size (px)

Citation preview

Page 1: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Performance Analysis Performance Analysis OOf f Generics Generics

IIn Scientific Computingn Scientific Computing

Laurentiu Dragan Stephen M. WattLaurentiu Dragan Stephen M. Watt

Ontario Research Centre for Computer AlgebraOntario Research Centre for Computer Algebra

University of Western OntarioUniversity of Western Ontario

SYNASC 2005SYNASC 2005

Page 2: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

OverviewOverview

MotivationMotivation

Parametric Polymorphism ImplementationParametric Polymorphism Implementation

Generalizing A Numeric BenchmarkGeneralizing A Numeric Benchmark

Language IssuesLanguage Issues

ResultsResults

Potential OptimizationsPotential Optimizations

ConclusionConclusion

Page 3: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

MotivationMotivation

Increasing demand for generic codeIncreasing demand for generic code

Scientific code requires high-performance making Scientific code requires high-performance making optimizations very importantoptimizations very important

Generic code – not as fast as specialized codeGeneric code – not as fast as specialized code

No tools to measure performance of generic codeNo tools to measure performance of generic code

Benchmarks – tool to measure the performance Benchmarks – tool to measure the performance

SciGMark – benchmark for generic codeSciGMark – benchmark for generic code

Compilers – optimize the generic code – performance Compilers – optimize the generic code – performance close to hand specialized codeclose to hand specialized code

Page 4: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Parametric Polymophism Parametric Polymophism ImplementationImplementation

Some languages with support for GenericsSome languages with support for Generics– Aldor, C++Aldor, C++– Java, C#Java, C#

Some types can be given as parametersSome types can be given as parameters

ImplementationsImplementations– Homogeneous: Java, C#Homogeneous: Java, C#

Share the generic codeShare the generic code

Example: Example: Vector<Integer>Vector<Integer> → → Vector Vector with elements of type with elements of type ObjectObject

– Heterogeneous: C++, C#Heterogeneous: C++, C#Specialize the generic codeSpecialize the generic code

Example: Example: std::vector<int>std::vector<int> → new specialized class → new specialized class

Page 5: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Generalizing A Numeric Generalizing A Numeric BenchmarkBenchmark

SciMark 2SciMark 2

Polynomial MultiplicationPolynomial Multiplication

Implemented in Aldor, C++, C#, JavaImplemented in Aldor, C++, C#, Java

Page 6: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

SciMark 2SciMark 2

Fast Fourier transform – 1024Fast Fourier transform – 1024– Complex arithmetic, shuffling, non-constant memory Complex arithmetic, shuffling, non-constant memory

reference, trigonometric functionsreference, trigonometric functions

Jacobi successive over-relaxation – 100x100Jacobi successive over-relaxation – 100x100– Typical access patterns in finite difference applicationsTypical access patterns in finite difference applications

Monte Carlo integrationMonte Carlo integration– Random number generator, function inliningRandom number generator, function inlining

Sparse matrix multiplication – 1000, 5000 non-zeroSparse matrix multiplication – 1000, 5000 non-zero– Indirection addressing, non-regular memory referencesIndirection addressing, non-regular memory references

Dense LU factorization – 100x100Dense LU factorization – 100x100– Dense matrix operationsDense matrix operations

Page 7: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

From SciMark 2 to SciGMarkFrom SciMark 2 to SciGMark

SciMark – double hardcodedSciMark – double hardcoded– Arrays are of type doubleArrays are of type double– Any change – extensive modifications to the codeAny change – extensive modifications to the code

SciGMark – classes are parametricSciGMark – classes are parametric– Change representation – minimal code changesChange representation – minimal code changes– Double becomes parameter RDouble becomes parameter R

+

R a(R o)

void ae(R o)

doubleR

DoubleRing

Class SOR {Class SOR { double[] array;double[] array;}}

Class SOR < R extends IRing<R> > {Class SOR < R extends IRing<R> > { R [ ] array;R [ ] array;}}

Page 8: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Basic Generic TypesBasic Generic Types

IRing IRing – Provides operations for addition, subtraction, multiplication, Provides operations for addition, subtraction, multiplication,

division – mutable, non-mutabledivision – mutable, non-mutable– Conversions to and from Conversions to and from intint and and doubledouble– Factories to produce new elements of these typeFactories to produce new elements of these type

DoubleRing – wrapper for doubleDoubleRing – wrapper for double– Implements IRingImplements IRing

ComplexComplex– Implements IComplex (simple extension to IRing)Implements IComplex (simple extension to IRing)– Complex<R extends IRing<R>> Complex<R extends IRing<R>>

implements IComplex<Complex<R>,R> implements IComplex<Complex<R>,R>

Page 9: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Generic Tests Generic Tests

GenFFTGenFFT– Uses R: Complex<DoubleRing>Uses R: Complex<DoubleRing>– Complex numbers – two consecutive entries in the arrayComplex numbers – two consecutive entries in the array

Depending on the application – different representation (e.g. Depending on the application – different representation (e.g. Hermitian matrix)Hermitian matrix)

GenMat, GenLUGenMat, GenLU– Use R: DoubleRingUse R: DoubleRing– The classes contain more methods – the whole class The classes contain more methods – the whole class

contains a type parametercontains a type parameter

GenSOR, GenMonteCarloGenSOR, GenMonteCarlo– Use R: DoubleRingUse R: DoubleRing– Have single static method with a type parameterHave single static method with a type parameter

Page 10: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Polynomial MultiplicationPolynomial Multiplication

40 coefficients40 coefficients

Dense representation unidimensional arrayDense representation unidimensional array

Regular memory access, temporary objects creation Regular memory access, temporary objects creation (memory allocation)(memory allocation)

ImplementationImplementation– DensePolynomialDensePolynomial

DensePolynomialG <E extends IRing<E> > implements DensePolynomialG <E extends IRing<E> > implements IRing<DensePolynomialG<E> >IRing<DensePolynomialG<E> >

– SmallPrimeFieldSmallPrimeFieldRepresented by an intRepresented by an int

SmallPrimeFieldG implements IRing<SmallPrimeFieldG>SmallPrimeFieldG implements IRing<SmallPrimeFieldG>

Page 11: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Specializing Polynomial Specializing Polynomial MultiplicationMultiplication

The code was initially implemented using genericsThe code was initially implemented using generics

Inlined all the calls to Inlined all the calls to SmallPrimeFieldSmallPrimeField

Replaced all the instances of Replaced all the instances of SmallPrimeFieldSmallPrimeField with with intint

Essentially the inverse of the operation performed to Essentially the inverse of the operation performed to “generalize” the SciMark“generalize” the SciMark

No changes to the algorithm – all changes could be No changes to the algorithm – all changes could be performed automaticallyperformed automatically

Page 12: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Language IssuesLanguage Issues

JavaJava– No operator overloadingNo operator overloading– Homogeneous – erasure technique – subclassingHomogeneous – erasure technique – subclassing– Implemented at language level – no virtual machine support Implemented at language level – no virtual machine support

– limitations – require object factory– limitations – require object factory– Type inference for generics is invariant – Pass the type as Type inference for generics is invariant – Pass the type as

argument argument Complex <R extends IRing<R>> implements IComplex<Complex<R>,R>Complex <R extends IRing<R>> implements IComplex<Complex<R>,R>

C#C#– Reference types (homogeneous) – Java; primitive types Reference types (homogeneous) – Java; primitive types

(heterogeneous) – C++(heterogeneous) – C++– Structures instead of classes – structures in collections are Structures instead of classes – structures in collections are

boxedboxed

Page 13: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Language IssuesLanguage Issues

C++C++– HeterogeneousHeterogeneous– Parametric polymorphism (templates) macro processorParametric polymorphism (templates) macro processor– No bounded polymorphismNo bounded polymorphism– No way to test the generic class until is instantiateNo way to test the generic class until is instantiate

AldorAldor– HomogeneousHomogeneous– Supports dependent typesSupports dependent types– Polymorphic types constructed using domain constructing Polymorphic types constructed using domain constructing

functionsfunctions

Page 14: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

SciGMark ResultsSciGMark Results

Results in MFlops Results in MFlops

Testing environment:Testing environment:– Pentium IV – 3.2GHz (1MB cache), 2 GB RAMPentium IV – 3.2GHz (1MB cache), 2 GB RAM– Windows XP SP2Windows XP SP2– Cygwin/GCC 3.4.4Cygwin/GCC 3.4.4– Sun JDK 1.5.0_04 Sun JDK 1.5.0_04 – Microsoft .NET v2.0.50215Microsoft .NET v2.0.50215– Aldor 1.0.2Aldor 1.0.2

Page 15: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

SciGMark ResultsSciGMark Results

N/A35920320244415743471Comp.

401566321282274836562PM

100x10055354031898274780103LU

1000, 500048544773941011173987MM

N/A20390622826226546MC

100x10041715417226816641971SOR

1024340124273212336559FFT

SpeGenSpeGenSpeGenSpeGenSize

AldorC#JavaC++Test

Page 16: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

SciGMark ResultsSciGMark Results

N/A35920320244415743471Comp.

401566321282274836562PM

100x10055354031898274780103LU

1000, 500048544773941011173987MM

N/A20390622826226546MC

100x10041715417226816641971SOR

1024340124273212336559FFT

SpeGenSpeGenSpeGenSpeGenSize

AldorC#JavaC++Test

Page 17: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

SciGMark ResultsSciGMark Results

N/A35920320244415743471Comp.

401566321282274836562PM

100x10055354031898274780103LU

1000, 500048544773941011173987MM

N/A20390622826226546MC

100x10041715417226816641971SOR

1024340124273212336559FFT

SpeGenSpeGenSpeGenSpeGenSize

AldorC#JavaC++Test

Page 18: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Aldor ResultsAldor Results

Testing environment:Testing environment:– Pentium IV – 3.2GHz (1MB cache), 2 GB RAMPentium IV – 3.2GHz (1MB cache), 2 GB RAM– Linux Fedora Core 3Linux Fedora Core 3– Aldor 1.0.2Aldor 1.0.2

Stanford benchmark Stanford benchmark – Aldor’s performance can be almost as good as C++Aldor’s performance can be almost as good as C++

Page 19: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Aldor ResultsAldor Results

1.291.43Comp int

1.051.07Comp FP

407190.24268380.38Oscar FFT

203550.49143420.69FP Mat Mult

102.00101.00Tree Sort

190890.53135260.74Bubble Sort

152140.66125380.79Quick Sort

46262.1634842.89Puzzle

491550.20153860.65Mat Mult

219870.45197000.528-Queen

469240.21172970.58Towers

269010.37234000.43Permutations

IterationsTimeIterationsTime

C++AldorTest

Page 20: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Potential OptimizationsPotential Optimizations

6-18 times performance improvement6-18 times performance improvement

Specialized codeSpecialized code– Same algorithmSame algorithm– Generic types replaced by specialized typesGeneric types replaced by specialized types– Eliminate generic wrapper objects – primitive typesEliminate generic wrapper objects – primitive types

Page 21: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Test Case AldorTest Case Aldor

Domain producing function:Domain producing function:

PolynomialVect(C: Ring) == add {PolynomialVect(C: Ring) == add { Rep == Vector Polynomial C; Rep == Vector Polynomial C; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i + j; res(k) := i + j; per res per res } }}}PC == PolynomialVect(Complex DoubleFloat);PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);PQ == PolynomialVect(Rational);

Page 22: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Test Case AldorTest Case Aldor

PolynomialVect(PolynomialVect(CC: Ring) == add {: Ring) == add { Rep == Vector Polynomial Rep == Vector Polynomial CC;; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i res(k) := i ++ j; j; per res per res } }}}PC == PolynomialVect(Complex DoubleFloat);PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);PQ == PolynomialVect(Rational);

Domain producing function:Domain producing function:

Page 23: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Test Case AldorTest Case Aldor

Specialize the domain producing functionSpecialize the domain producing function

PC == add {PC == add { Rep == Vector Polynomial Rep == Vector Polynomial Complex DoubleFloatComplex DoubleFloat;; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i res(k) := i ++ j; j; -- ‘+’ from Complex-- ‘+’ from Complex per res per res } }}}

Page 24: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

Optimize Data RepresentationOptimize Data RepresentationScalar product of vector of complex numbersScalar product of vector of complex numbers

dot(u: Vector Complex R, v: Vector Complex R): Complex R == {dot(u: Vector Complex R, v: Vector Complex R): Complex R == { ss: Complex R := 0;: Complex R := 0; for i in 1..n repeat for i in 1..n repeat ss := := ss + u.i*v.i; + u.i*v.i; return s; return s;}}

dot(u: Vector Complex R, v: Vector Complex R): Complex R == {dot(u: Vector Complex R, v: Vector Complex R): Complex R == { xx: R := 0; : R := 0; yy: R := 0;: R := 0; for i in 1..n repeat { for i in 1..n repeat { xx := := xx + real(u.i)*real(v.i) - imag(u.i)*imag(v.i); + real(u.i)*real(v.i) - imag(u.i)*imag(v.i); yy := := yy + real(u.i)*imag(v.i) + imag(u.i)*real(v.i); + real(u.i)*imag(v.i) + imag(u.i)*real(v.i); } } return complex(x,y); return complex(x,y);}}

Page 25: Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western

ConclusionConclusion

Generics important for scientific computing – rich Generics important for scientific computing – rich mathematical models – easy to implement with generic mathematical models – easy to implement with generic codecode

Need a tool to measure the compiler ability to produce Need a tool to measure the compiler ability to produce efficient codeefficient code

We have seen difference of 6-18 times between We have seen difference of 6-18 times between generic and specialized code – room for improvement generic and specialized code – room for improvement in compilers capabilitiesin compilers capabilities

Presented some optimizations ideasPresented some optimizations ideas

http://www.orrca.on.ca/benchmarks/scigmark/1.0/http://www.orrca.on.ca/benchmarks/scigmark/1.0/