25
The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence Berkeley National Laboratory Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

The DOE ACTS CollectionSIAM Parallel Processing for Scientific

Computing, Savannah, Georgia Feb 15, 2012

Tony DrummondComputational Research DivisionLawrence Berkeley National Laboratory

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Page 2: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

The DOE ACTS Collection Project

Goal: The Advanced CompuTational Software Collection (ACTS) makes reliable and efficient software tools more widely used, and more effective in solving the nation’s engineering and scientific problems.

Tony Drummond and Osni MarquesComputational Research Division

Lawrence Berkeley National Laboratory

Page 3: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

What is the Role of ACTS in the HPC Software Stack?

APPLICATIONS

GENERAL PURPOSE TOOLS

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

Page 4: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

APPLICATIONS

ACTS Plays a Critical Role in the HPC Software Stack

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

GENERAL PURPOSE TOOLS

Accelerate Application Code Development

PETSc

SLEPc

SuperLU

AztecOO

ScaLAPACK

TAO

Hypre

ATLAS

PyACTS

Global Arrays

SUNDIALS

Overture

TAU

By maintaining a solid collection with some of the best numerical kernels and support tools for code development, run-time and library optimization

Page 5: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

The DOE ACTS CollectionCategory Tool Functionalities

Numerical AztecOO Scalable linear and non-linear solvers using iterative schemes.

Hypre A family of scalable preconditioners.

PETSc Scalable linear and non-linear solvers and additional support for PDE related work.

OPT++ Object-oriented nonlinear optimization solvers.

SUNDIALS Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations.

ScaLAPACK High performance parallel dense linear algebra.

SLEPc Scalable algorithms for the solution of large sparse eigenvalue problems.

SuperLU Scalable direct solution of large, sparse, nonsymmetric linear systems of equations.

TAO Large-scale optimization software.

Code DevelopmentGlobal Arrays Supports the development of parallel programs.

Overture Supports the development of computational fluid dynamics codes in complex geometries.

Run Time SupportTAU Portable and scalable performance analyzes and tracing tools for C, C++, Fortran and Java

programs.

Library DevelopmentATLAS Automatic generation of optimized numerical dense algebra for scalar processors.

Page 6: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Numerical Functionality in the ACTS Collection

minx | | b− Ax | |2

minx | | x | |2

minx | | x | |2

minx | | b− Ax | |2

Az = λz

A = UΣVT

A = UΣVH

Az = λBzABz = λzBAz = λz

Ax = b or AX = B

Hx = b’

Commonalities among ACTS Tools:• General purpose user interfaces• Parallel and Scalable implementations of

numerical algorithms• Modular design (kernel reusability)• Parallelism exploited at the MPI_TASK level

(newer versions under development to support other levels of concurrency)

Page 7: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems

TOO

L D

EV

ELO

PE

RS

Hand-Tuned Codes

Compiler optAuto-tuning

New ProgrammingEnvironments

CS Communityefforts

APPLICATIONDEVELOPERS

APPLICATIONS

Challenge is avoid code rewrite for performance

Page 8: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems

• Development of new numerical functionalities and implementation

• Integration of auto-tuned kernels (BLAS, LAPACK, TORCH, etc. .)

• Adoption of new programming models and paradigms

• Tool and Functionality Selection• Choosing functional parameters• Compile/Link: Integrating Optimized Kernels • Runtime: Dynamic Kernel Selection• Verification: Robustness and ScalabilityA

PP

LIC

ATIO

ND

EV

ELO

PE

RS

TOO

L D

EV

ELO

PE

RS

Page 9: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems

Q. Can Performance Scalability be passed from libraries/tools to applications and across platforms and configurations?

Page 10: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

LU Solve (ScaLAPACK)

!"

#"

$!"

$#"

%!"

%#"

&!"

'()"*"+%,%-" '(")"."+%,*-" '(")"$/"+*,*-" '(")"%*+*,/-"

% im

prov

emen

t

Number of MPI_TASKS (np)

$$!!!"

$&!!!"

$#!!!"

CRAY - XTE6

Execution time improvement vs. performance scalability

!"!!#$!!%

&"!!#$!!%

'"!!#$!!%

("!!#$!!%

)"!!#$!!%

*"!!#$!*%

*"&!#$!*%

*"'!#$!*%

'% )% *(% &'%

!"#$"%

&'("

)*+,&-"#$".#-&/"

0-#,1&+"/23&456777856777"

*+!!!%

Compiler Optimized vs Highly Tuned

Page 11: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

!!"""#

!$"""#%""""#

"#

&""#

!""#

%""#

'""#

()*#'#+!,!-# ()#*#.#

+!,'-# ()#*#&$#+','-# ()#*#!'

+',$-#

!!"""#

!$"""#

%""""#

Portable Performance is no longer straight forward

NPThreads/

MPI_TASK1(24)

4 6

8 3

16 1

24 1

NPThreads/

MPI_TASK2(48)

4 12

8 6

16 2

24 2

Doubling Problem Size and Adding 1 node

!"!!#$!!%

&"!!#$!!%

'"!!#$!'%

'"&!#$!'%

("!!#$!'%

("&!#$!'%

)"!!#$!'%

*% +% ',% (*%

Total MPI Tasks

!"#$"%&'(")&*"+#,&"-./01"2345"

Page 12: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

!!"""#

!$"""#%&"""#

"#'""#(""#!""#%""#&""#)""#*""#

+,-#%#.(/(0# +,#-#1#

.(/%0# +,#-#')#.%/%0# +,#-#(%

.%/)0#

!!"""#

!$"""#

%&"""#

Portable Performance is no longer straight forward

NPThreads/

MPI_TASK1(24)

4 6

8 3

16 1

24 1

Trice the Problem Size and 3 nodes

NPThreads/

MPI_TASK3(72)

4 18

8 9

16 4

24 3

!"!!#$!!%

&"!!#$!!%

'"!!#$!'%

'"&!#$!'%

("!!#$!'%

("&!#$!'%

)"!!#$!'%

*% +% ',% (*%

Total MPI Tasks

!"#$"%&'(")&*"+#,&"-./01"2345"

Page 13: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems

Q. Can Performance Scalability be passed from libraries/tools to applications and across

platforms and configurations?

A. Yes, maybe with a lot of automatic work• Fully operational through various parameters and

levels of automation• Application developers are very reluctant to

change their code then preserve in as much as possible current structure of APIs

Page 14: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

ACTS Parametric Research and Collaborations

Use of ACTS parameters to ensure application scalability (pACTS)

Pre-Installation

Library Installation

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

Run Time

Compile + linkApplication

job submit options

APPLICATIONS

GENERAL PURPOSE TOOLS

Page 15: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Auto-tune algorithmic parameters(smart-tuning)

Sustainable & scalable Performance for all

applications

Run-time selection oftuned executables (#cores/node)

With pACTS

Auto-tuning produces multiple tuned libraries using steering parameters (#cores/node)

}ACTS Parametric Research and Integration

Some applications won’t scale

Without pACTS

Auto-tuning produces a single tuned library n=max-cores/node

Hand-tuning algorithmic parameters can be cumbersome

PLATFORM SUPPORT TOOLS AND UTILITIES

GENERAL PURPOSE TOOLS

APPLICATIONS

HARDWARE

Page 16: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Multi-Level Tuning to Attain Scalable Performance

•Optimized dense BLAS kernels• Algorithmic optimization•Minimize computational costs (storage + ops)• Sustain numerical stability and reliability• Specialized problem solving techniques

• Software Implementations:• Specialized Data Structures•Maximize Load balancing•Minimize Latencies, Idle time, etc. .

Auto-tuning

Auto-tuning

GENERAL PURPOSE TOOLS

APPLICATIONS

Smart-tuning

Page 17: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Tuning at Library Installation Level

APPLICATIONS

GENERAL PURPOSE TOOLS

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

• Compiler level optimizations• Specialized communication libraries and

other custom-made support libraries• Auto-tuners

Performance Tuning Parameters (PT-pACTS)and Software Dependencies (SD-pACTS):• arithmetic and arithmetic precision• automatic threading• compiler• communication libraries and paradigms• software requirements

Software Resources:

AC

TS P

AR

AM

ETER

S

Page 18: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Parametrize Optimized Installation of Libraries+Apps

APPLICATIONS

GENERAL PURPOSE TOOLS

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

PT-pACTS, SD-pACTS• NUMA Aware, Thread, cache, TLB and local

store blocking, padding, register and format selection

FP-pACTS• Output of performance monitoring• Optimized library and kernels labeling

• Auto-tuners• Performance Monitors• Functional Performance Parameter Derivation

(FP-pACTS)

Software Resources:

AC

TS P

AR

AM

ETER

S

Page 19: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Runtime Dynamic Selection of Kernels/Libraries

APPLICATIONS

GENERAL PURPOSE TOOLS

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

• Smart-tuning tools• Choice of ACTS tool(s) and functionality• Choice of calling parameters (RT-pACTS)

PT-pACTS, SD-pACTS, FP-pACTS and RT-pACTS•algorithmic (functional calls)•application numerical requirements •problem size•resource utilization

• Runtime scripts (e.g., job submission scripts)• Runtime Parameters impacting application

performance (RT-pACTS)• Functional Performance Parameter Derivation

(FP-pACTS)

Software Resources:

AC

TS

PAR

AM

ETER

S

Page 20: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

• PDGETRF and PDGETRS implementations• BLAS DGEMM, DTRSM implementations

• Number of cores/nodePT-pACTSSD-pACTS

Simple Example of ScaLAPACK LU

ScaLAPACK

BLAS

LAPACK BLACS

MPI/PVM/...

PBLASGlobal

Local

platform specific

Library Installation Time

OUTPUT: Tuned kernels

PBLAS_A01V1PBLAS_DEFAULT

PBLAS_A01V2PBLAS_A01V3

PBLAS_A01Vn:

BLACS_A01V1BLACS_DEFAULT

BLACS_A01V2BLACS_A01V3

BLACS_A01Vn:

Page 21: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Simple Example of ScaLAPACK LU

• PDGETRF and PDGETRS implementations• BLAS DGEMM, DTRSM implementations

• MPI_TASKS/node• Blocking factor

FP-pACTSSD-pACTS

Application Code Link Time

APPLICATIONS

GENERAL PURPOSE TOOLS

PLATFORM SUPPORT TOOLS AND UTILITIES

HARDWARE

OUTPUT: Tuned kernels

PBLAS_A01V1PBLAS_DEFAULT

PBLAS_A01V2PBLAS_A01V3

PBLAS_A01Vn:

BLACS_A01V1BLACS_DEFAULT

BLACS_A01V2BLACS_A01V3

BLACS_A01Vn:

kernelSelector

Page 22: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Total MPI Tasks

% o

f Pea

k

0

12.5

25.0

37.5

50.0

1 2 6 16 24

Best in Node Performance

Using RT-pACTS Without RT-pACTS

4

Example of ScaLAPACK LU

RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix Blocking• Process Grid

FP-pACTS• Matrix 2D Blocking • Process Grid

SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation ACML

16x16

!"!!#$!!%

&"!!#$!!%

'"!!#$!'%

'"&!#$!'%

("!!#$!'%

("&!#$!'%

)"!!#$!'%

*% +% ',% (*%

Total MPI Tasks

!"#$"%&'(")&*"+#,&"-./01"2345"

Page 23: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Example of ScaLAPACK LU

Total MPI Tasks

% o

f Pea

k

0

12.5

25.0

37.5

50.0

1 2 6 16 24

Best in Node Performance

Using RT-pACTS Without RT-pACTS

SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation

4

RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix 2D Blocking• Process Grid

FP-pACTS• Matrix 2D Blocking • Process Grid

optimized kernels 2-cores

4 MPI_TASKS/node45000

NB=8

!"!!#$!!%

&"!!#$!!%

'"!!#$!'%

'"&!#$!'%

("!!#$!'%

("&!#$!'%

)"!!#$!'%

*% +% ',% (*%

Total MPI Tasks

!"#$"%&'(")&*"+#,&"-./01"2345"

Page 24: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Example of ScaLAPACK LU

Total MPI Tasks

% o

f Pea

k

0

12.5

25.0

37.5

50.0

1 2 6 16 24

Best in Node Performance

Using RT-pACTS Without RT-pACTS

SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation

4

RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix 2D Blocking• Process Grid

FP-pACTS• Matrix 2D Blocking • Process Grid

optimized kernels 16-cores

16 MPI_TASKS/node45000

NB=8

!"!!#$!!%

&"!!#$!!%

'"!!#$!'%

'"&!#$!'%

("!!#$!'%

("&!#$!'%

)"!!#$!'%

*% +% ',% (*%

Total MPI Tasks

!"#$"%&'(")&*"+#,&"-./01"2345"

Page 25: Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012  · Performance for all applications Run-time selection of tuned executables (#cores/node) With pACTS Auto-tuning

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012

Concluding Remarks

• HPC centers vs. Installation in your laptop• On going-work in parametric research• TORCH Kernels • Parameter derivation and selection

S. Petiton and C. Calvin

• Current tests used older version of OSKI, hand-tuned kernels and acml (blas)• Enlarge the set of auto-tuners

• Incorporate new ACTS tool developments