Automated Empirical Optimization of High Performance ...icl.utk.edu/news_pub/submissions/clint_whaley_intv.pdfshared-memory parallelism & scalability iFKO autotuning known kernels

Automated Empirical Optimization of High

Performance Floating Point Kernels

R. Clint [email protected]

www.cs.utsa.edu/∼whaley

University of Texas at San AntonioDepartment of Computer Science

February 19, 2013

R. Clint Whaley (UTSA-CS) empirical optimization February 19, 2013 1 / 28

Outline of Talk

1 Personal Background

2 Research philosophy &contributing with opensource software

3 Introduction to my research

4 Motivation

5 Computational OptimizationApproach

6 Intro to ATLAS

7 Intro to iFKO

8 Outline of research areas

9 Examples of ATLAS-styleresearch

10 Examples of iFKO-styleresearch

11 Current work

12 Future Directions

13 Funding History

14 Further info / Q&A


Personal Background

90-91: undergraduateresearcher at ORNL,distributed memoryparallelization of nuclearcollision models

91-94: GRA, UTK,distributed memoryparallelization withScaLAPACK and BLACS

X May 94: M.S. in dist. mem.‖(advisor: Jack Dongarra)

94-99: Research Associate,UTK: ScaLAPACK, HPL,HPF, clusters→ATLAS

99-01: Senior ResearchAssociate, UTK: ATLAS

→ Weakness: backendcompilation

02-04: GRA, FSU: empiricalbackend compilation (iFKO)

X Dec 04: Ph.D. in empiricalcompilation (advisor: DavidWhalley)

X 05-12: Assistant Professor,UTSA

X 12-present: AssociateProfessor wt tenure, UTSA


Research PhilosophyContributing to Science and Engineering through open source software

Research PhilosophyI believe computer science should be approached as a science

Reproducible experiments are a hallmark of science

Papers w/o assoc software prevent true reproduction/verification

⇒ If software made usable and stable, software frameworksdeveloped by CS researchers can make strong contributions tomost fields in mathematics, science and engineering

My research goal is to maximize real-world impact:Always start research from real-world problem

For every successful advance, embody it in open frameworks forall to use

In this way, I have been able to have a large impact both insideand outside my field


An Introduction to my ResearchMy research oriented to increasing performance of fp computations

As student, my research embodied in widely-useddistributed memory parallel linear algebra libraries:

One of the core developers of ScaLAPACK (Scalable LinearAlgebra PACKage), still in use today

Main developer of its communication layer, the BLACS

Contributer to HPL benchmark (Petitet)

As an independent researcher, main focus is onautomated empirical optimization of computational kerns

ATLAS (Automatically Tuned Linear Algebra Software) used byhundreds of thousands in almost every field world-wide

iFKO (iterative Floating Point Kernel Optimizer): a way toextend my research’s impact beyond linear algebra to arbitraryHPC kernels


MotivationWhy focus on computational optimization?

Ultimate goal of research is to make extensivehand-tuning unnecessary for HPC kernel production

For many operations, no such thing as “enough” compute power

Therefore, need to extract near peak performance even ashardware advances according to Moore’s Law

Achieving near-optimal performance is tedious, time consuming,and requires expertise in many fields

Such optimization is neither portable nor persistent

⇒ State-of-the-art approach uses automated empirical optimization


Computational Optimization ApproachAutomated Empirical Optimization of Software (AEOS)

Key IdeaProbe machine empirically, keep transforms resulting in measurableimprovements; Automate process so code is auto-tuned w/oadvanced knowledge of architecture.

GoalOptimized, portable kernel (wt associated library) available in hoursrather than months or years (or never).

Package approachesATLAS: use search-driven parameterization, mult implem, &source gen to tune given kernels to arbitrary architectures

iFKO: use search-driven empirical compilation to tune arbitrarykernels to a handful of select architectures


ATLASAutomatically Tuned Linear Algebra Software

What is ATLAS?One of the pioneering packages in empirical tuning for HPC

Provides high performance dense linear algebra routines:

BLAS (matmul, trislv), some LAPACK (linslv, eigen, SVD, etc.)

Automatically adapts itself to differing architectures usingempirical techniques

Used in: Scientific simulation (physics, chemistry, biology,astronomy, engineering, math), medical imaging, machinelearning; almost all supercomputers; built-in to many OSes(OS X, most Linux), used by many PSEs (Maple, MAGMA,Mathematica, Matlab, Octave), large numbers of packages(GSL, HPL, SciPy, R, come compilers, etc), and industry (IBM,Google, mobile computing oil & gas exploration, financialmodeling and analysis, etc.)


iterative Floating Point Kernel OptimizeriFKO is an iterative and empirical compiler framework

iFKO composed of:1 A collection of search drivers,2 a compiler specialized for

empirical floating pointkernel optimization (FKO)

Specialized in analysis,HIL, type & flexibility oftransforms

Key design decisions:Iterative & empirical forauto-adaptation

Transforms done at backend,allowing arch exploitation (eg.SIMD vect, CISC inst formats)

Kernel annotation markup

Narrow & deep focus

InputRoutine

HIL + flags--

SearchDrivers-

-problemparams -

HIL -

SpecializedCompiler

(FKO)

analysis results�

optimizedassembly

-Timers/Testers

performance/test results�

iFKO


Outline of ResearchTypes of research arising from ATLAS and iFKO

ATLASsystems

numerical analysis/scientific computing

architectural exploitation for algorithmic development

shared-memory parallelism & scalability

iFKOautotuning known kernels

compiler transformation research

new algorithms based on arbitrary kernels


Systems research from ATLAS

Anthony M. Castaldo and R. Clint Whaley, “Minimizing StartupCosts for Performance-Critical Threading” In Proceedings of the23rd IEEE International Parallel and Distributed ProcessingSymposium (IPDPS2009), pages 1-8, Rome, Italy, May 25-29,2009.

Why has our parallel scalability dropped thru the floor?Found OS scheduling destroyed scaling due to thread spawneven on an O(N3) algorithm!

Presented thread spawning algorithm that fixes the problemusing thread affinity

Perf of GEMM now scales with p; as much as 40% improvementin apps like factorizations (Linux/Windows)

→ This became ATLAS’s standard thread spawn strategy


Scientific Computing research from ATLASError control in numerical computation

Anthony M. Castaldo, R. Clint Whaley and Anthony T.Chronopoulos, “Reducing Floating Point Error in Dot Productusing the Superblock Family of Algorithms”, SIAM Journal ofScientific Computing, Volume 31, No. 2, pp 1156-1174, 2008.

Why can’t invert triangular mat as part of TRSM?⇒ Pollutes back err wt cond number (textbook, G. W. Stewart)

But this got us interested in error control

Dot prod basis of GEMM, and thus controls err of most LA

Surveyed known err bounds for dot product, showed alg’s actual errorfollow sorting of bounds

Presented a new class of alg called superblock subsuming known

Superblock allows explicit trade-off bet storage/perf and err cont→ Changed ATLAS’s GEMM for improved error control


Parallelism & Scalability in ATLASArchitectural exploitation for algorithmic development

Anthony M. Castaldo, Siju Samuel and R. Clint Whaley,“Scaling LAPACK Panel Operations Using Parallel CacheAssignment”, accepted for publication in ACM Transactions onMathematical Software (TOMS).

How can we parallelize panel factorization?Used cache coherence mechanism to make parallelcommunication run at hardware speed

Moved par overheads into inner loop to enable cache reuse

Got superlinear speedups on algs where speedup is usually ≈2x

Presented mathematical trick leading to 100x speedups

Documented allocation algorithm to avoid drastic perf loss onAMD’s HT-assist machines

→ This became ATLAS’s standard panal factorization method


Tuning the L1BLAS using iFKOInnermost-loop backend optimization for the x86 family

R. Clint Whaley and David B. Whalley, “Tuning HighPerformance Kernels through Empirical Compilation”, In The2005 International Conference on Parallel Processing, June 2005.

Can an empirical compiler beat hand-tuning?

7 fundamental (tunable) optimizations (loop unroll, acc exp,pref, NTW, etc)

9 repeatable optimizations (reg alloc, copy prop, etc.)

Applied to simple single-loop kernels

Beat hand-tuned (ATLAS & MKL) on all but a few kernels

→ Lost when we could not autovectorize past branches


Ongoing WorkSome of our present ATLAS and iFKO investigations

ATLAS: How to scale fact. to O(100) shared mem cores?

Rakib Hasan (PhD candidate): parallelism and scaling

MKL/ACML facts better scaling than lapack/ATLAS

There are many parallelization methods known for factorizations

Which are the important ones in achieving these types of scaling?

→ have proto tying vendor libs for large probs, wins for small-med

iFKO: How can we vectorize past branches

M. Sujon (PhD cand) & Qing Yi: mainstream compiler research

Use novel technique when branch is strongly directional

Use known techniques (redundant computation, etc.) when not

ATLAS&iFKO: Novel algorithms using arbitrary kerns

Majedul Sujon & Rakib Hasan

Increased performance (panel/Hess) via BLAS blendingR. Clint Whaley (UTSA-CS) empirical optimization February 19, 2013 15 / 28

Future Directions

short term

ATLAS GEMM & threading redesign

Extensions of PCA & related kernels to enable scaling even on small& medium-sized kernels as well as large

Expansion of loop handling; use iFKO to tune more complex kerns

longer term

Increased handling of scale, including distributed memory &accelerators (grant under review with PLASMA group to explore)

Adapting to ongoing architectural trends (power/ARM, etc)

Applying iFKO to new areas and apps; new algs using new kerns

Release iFKO for general use

Testing limits of empirical compilation (how many search types arerequired? How many backends needed? Can improve timergeneration?, etc.)


Funding while at UTSAGrant summary and active funding

Between 2005-2012 brought in more than $3.3 million:

Almost $2.5 million as a sole investigator

Obtained funding from: NSF, DoD, DoE and industry

My software used by all these groups, which improves fundingchances

Active, sole-PI grantsNSF CAREER award, OCI-1149303, “Empirical Tuning forExtreme Scale”, $538,145, 03/01/12 – 02/28/17.

DoD contract, “ATLAS Research and Development”, $984,720,03/01/11 – 12/31/2015.

SiCortex Research Gift, $10,000 (donated 2007)


Funding while at UTSAHistorical sole-PI funding at UTSA

sole-PI grantsR. Clint Whaley, NSF REU supplement for CNS-0551504.04/22/08 - 02/28/09, $12,000.

R. Clint Whaley, “ATLAS Research and Development”. DoDH98230-06-C-0443, August 1, 2006 - September 30, 2011,$790,446.

R. Clint Whaley, “CRI, Community Resource Development:ATLAS Support and Development”, NSF CRI award,CNS-0551504. 03/01/06 - 02/28/09, $100,000.


Funding while at UTSAExpired or soon-to-be-expired co-PI funding

co-PI grantsQing Yi, Daniel J. Quinlan and R. Clint Whaley, “AMulti-Language Environment For Programmable CodeOptimization and Empirical Tuning”, DOE Office of Scienceaward DE-SC0001770, 09/15/09 - 09/14/12, $360,000.

Qing Yi, Daniel J. Quinlan and R. Clint Whaley, “ProgrammableCode Optimization and Empirical Tuning For High-endComputing”, NSF award CCF-0833203, 09/01/08 - 09/31/10.$462,000.


Further Information and Q&A

Homepage : www.cs.utsa.edu/~whaley

Publications : www.cs.utsa.edu/~whaley/papers.html

ATLAS : math-atlas.sourceforge.net

BLAS : www.netlib.org/blas

LAPACK : www.netlib.org/lapack

BLACS : www.netlib.org/blacs

ScaLAPACK :www.netlib.org/scalapack/scalapack home.html


Courses Taught

Graduate Courses1 CS 5513 Computer Architecture

2 CS 6463 Fundamentals of High Performance Optimization

3 CS 6643 Parallel Processing

Undergraduate Courses1 CS 1073 Intro to Comp Prog for Scientific Apps

2 CS 1713 Intro to Comp Prog II

3 CS 3853 Computer Architecture

4 CS 4823 Parallel Programming


Graduate Students Graduated

Completed Thesis and Dissertations Supervised

1 Siju Samuel,“Maintaining High Performance in the QR FactorizationWhile Scaling Both Problem Size and Parallelism”, Master’s Thesis;defended May, 2011.

2 Anthony M. Castaldo, “Parallelism and Error Reduction in a HighPerformance Environment”, Doctoral Diss; defended Nov, 2010.

3 Anthony M. Castaldo, “Error Analysis of Various Forms of FloatingPoint Dot Products”, Master’s Thesis; defended Aug, 2007.

Completed Master’s Projects Supervised

1 Chad Zalkin, “SSE Code Generation for General Matrix MatrixMultiply”, Master’s project; successfully defended April, 2011.

Master’s Co-advised with Qing Yi

Mike Stiles, Jichi Gua, Faizur Rahman, Ziaul Haque, Akshatha Bhat


Refereed Journal Articles

1 Anthony M. Castaldo, Siju Samuel and R. Clint Whaley, “Scaling LAPACK Panel Operations Using Parallel CacheAssignment”, accepted for publication in ACM Transactions on Mathematical Software (TOMS).

2 Anthony M. Castaldo, R. Clint Whaley and Anthony T. Chronopoulos, “Reducing Floating Point Error in Dot Productusing the Superblock Family of Algorithms”, SIAM Journal of Scientific Computing, Volume 31, Number 2, pp1156-1174, 2008. (GSC: 4)

3 R. Clint Whaley and Anthony M. Castaldo, “Achieving accurate and context-sensitive timing for code optimization”,Software: Practice & Experience, Volume 38, Number 15, pp 1621-1642, April 2008. (GSC: 26)

4 R. Clint Whaley and Antoine Petitet, “Minimizing Development and Maintenance Costs in Supporting PersistentlyOptimized BLAS”, Software: Practice & Experience, Volume 35, Number 2, pp 101-121, February, 2005. (GSC: 162)

5 Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc and R. Clint Whaley, “SelfAdapting Linear Algebra Algorithms and Software”, Proceedings of the IEEE, Volume 93, Number 2, pp 293-312,February, 2005. (GSC: 140)

6 L. Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, Michael Heroux, LindaKaufman Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin Remington and R. Clint Whaley, “An Updated Setof Basic Linear Algebra Subprograms (BLAS)”, ACM Transactions on Mathematical Software, 28(2):135–151, June2002. (GSC: 256)

7 R. Clint Whaley, Antoine Petitet and Jack J. Dongarra, “Automated Empirical Optimization of Software and theATLAS Project”, Parallel Computing, Volume 27, Numbers 1-2, pp 3-25, 2001, ISSN 0167-8191. Also available asUniversity of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000. (GSC: 989)

8 L.S. Blackford, A. Cleary, J. Demmel, J. Dongarra, I. Dhillon, S. Hammarling, A. Petitet, H. Ren, K. Stanley, and R. C.Whaley, “Practical Experience in the Numerical Dangers of Heterogeneous Computing”, ACM Transaction onMathematical Software, Volume 23, Number 2, pp 133-147, June 1997. (GSC: 24)

9 J. Choi, J. Demmel, J. Dongarra, I. Dhillon, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley,“ScaLAPACK: a Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance”,Computer Physics Communication, 97 (1996) 1-15. (GSC: 352)

10 J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley, “The Design and Implementation of theScaLAPACK LU, QR, and Cholesky Factorizations”, Scientific Programming, Volume 5, pp 173-184, 1996. (GSC: 215)


Refereed Conference Proceedings

1 Anthony M. Castaldo and R. Clint Whaley, “Achieving Scalable Parallelization For The Hessenberg Factorization”, inIEEE Cluster 2011, pages 65-73, Austin, TX, September 26-30, 2011. (27.86% acceptance rate)

2 Anthony M. Castaldo and R. Clint Whaley, “Scaling LAPACK Panel Operations Using Parallel Cache Assignment”, InProceedings of the 2010 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages223-231, Bangalore, India, January 9-14, 2010. (16.76% acceptance rate, GSC: 16)

3 Anthony M. Castaldo and R. Clint Whaley, “Minimizing Startup Costs for Performance-Critical Threading” InProceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS2009), pages 1-8,Rome, Italy, May 25-29, 2009. (22.73% acceptance rate, GSC: 6)

4 R. Clint Whaley, “Empirically Tuning LAPACK’s Blocking Factor for Increased Performance”, Proceedings of theInternational Multiconference on Computer Science and Information Technology (Computer Aspects of NumericalAlgorithms), pages 303-310, Wisla, Poland, October 20-22, 2008. (this was a refereed workshop, GSC: 13)

5 Qing Yi and R. Clint Whaley, “Automated Transformation for Performance-Critical Kernels”, In ACM SIGPLANSymposium on Library-Centric Software Design, Montreal, Canada, October, 2007. (GSC: 13)

6 R. Clint Whaley and David B. Whalley, “Tuning High Performance Kernels through Empirical Compilation”, In The2005 International Conference on Parallel Processing, June 2005. (28.63% acceptance rate, GSC: 35)

7 R. Clint Whaley and Jack Dongarra, “Automatically Tuned Linear Algebra Software”, Ninth SIAM Conference onParallel Processing for Scientific Computing, March 22-24, 1999, CD-ROM Proceedings.

8 R. Clint Whaley and Jack Dongarra, “Automatically Tuned Linear Algebra Software”, SuperComputing 1998Conference, Orlando, FL, November 1998. Best paper in the systems category. (GSC: 712)

9 L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A.Petitet, K. Stanley, D. Walker, and R. C. Whaley, “ScaLAPACK: A Linear Algebra Library for Message-PassingComputers”, SIAM Conference on Parallel Processing for Scientific Computing, March 1997. (GSC: 67)

10 J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker and R. C. Whaley. Second International Workshop onApplied Parallel Computing (PARA’05), Lyngby, Denmark, August 21-24, 1996, “A Proposal for a Set of Parallel BasicLinear Algebra Subprograms”, Proceedings in Lecture Notes in Computer Science, Volume 1041, pp 107-114,Springer-Verlag, Berlin - Heidenberg - New York, 1996. (GSC: 190)

11 Jack Dongarra, Robert van de Geijn, and R. Clint Whaley, “Two Dimensional Basic Linear Algebra CommunicationSubprograms” Proceedings of the sixth SIAM Conference on Parallel Processing for Scientific Computing, SIAMPublications, pp 347-352, Norfolk, Virginia, March 1993. (GSC: 104)


Selected Other Publications

Invited Papers1 “ATLAS Version 3.8: Overview and Status”, Invited paper at the international Workshop on Automatic Performance

Tuning (iWAPT07), pages 1-11, Tokyo & Kyoto, Japan, September 16-22, 2007. URL: http://iwapt.org/2007/

Refereed Workshop Papers1 Josh Magee, Qing Yi, and R. Clint Whaley, “Automated Timer Generation for Empirial Tuning”, Fourth Workshop on

Statistical and Machine learning approaches to ARchitecture and compilaTion (SMART’10), January 24, 2010, Pisa,Italy.URL: http://ctuning.org/workshop-smart10

Books & Book Chapters1 L.S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A.

Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users’ Guide, SIAM Publications, Philadelphia, 1997.ISBN: 0-89871-400-1. (GSC: 1137)

2 R. Clint Whaley, “Automatically Tuned Linear Algebra Software (ATLAS)”, in Encyclopedia of Parallel Computing,David Padua ed., available August 16, 2011 from Springer Reference books, ISBN: 978-0-387-09765-7.

3 R. Clint Whaley, “ATLAS version3.9: Overview and Status”, inSoftware Automatic Tuning: From Concepts to State-of-the-Art Results, K. Naono, K. Teranishi, J. Cavazos, and R.Suda, ed., Springer New York Dordrecht Heidelberg London, 2010. ISBN: 978-1-4419-6934-7.

4 A. Petitet, H. Casanova, J. Dongarra, Y. Robert and R. C. Whaley, “Parallel and Distributed Scientific Computing”, J.Blazewicz, K. Ecker, B. Plateau, D. Trystram, ed., Handbook on Parallel and Distributed Processing, Springer-VerlagBerlin Heidelberg, 2000. ISBN: 3-540-6641-6.


I(b). Traditional Approach: Library Usage

Approach:1 Isolate time-critical sections of code, define & agree on API (eg.,

BLAS)

2 Get experts in required fields (type of computation, hardwareplatform and programming environment) to optimize kernels

Problems:Demand for experts far outstrips supply

Even wt experts, by time lib fully optimized, target architecturewell on its way towards obsolescence

If an API is not perceived as general enough, will not besupported


I(c) Shortcomings of CompilationTradition compilers are not designed for this level of optimization:

Opt general code to this degree counterproductive:

Comp dev & runtime time would explodeWould make no difference for 90% of code

Many ISAs far from hardware (OOE, reg renaming, dynamicscheduling, etc) ⇒ effects of compiler transformationunpredictable

Resource allocation is intractable approached a priori

Modeling machine & operation at this level is intractable

Interaction between all PEs and shared sources (all lvls of cache& pipelines of all relevant PEs, along wt their usage restrictions)

Differing kernel requires specialized models

How many/type of structures accessedUsage pattern (use/set) of all operandsContext of operation (eg., in or out of cache)


Panel Factorizationand Parallel Cache Assignment (PCA)

Panel factorization:

N × Nb data elements

O(N · N2b ) computations

Ideas behind PCA:

1 Use cache-coherence forhardware-speed syncs

2 Move parallel overheads intoinner loop to exploit reuse

3 Copy data to local cachesusing standard distribution(eg., block)

→ fit entire prob in collectivecache

@@

@@

L1,1

U1,1

LT

UT

AT

1st step blocked LU


Documents

Automated Empirical Optimization of High Performance ...icl.utk.edu/news_pub/submissions/clint_whaley_intv.pdfshared-memory parallelism & scalability iFKO autotuning known kernels