52
ALTANAL Abstraction Layer for Task bAsed NumericAl Libraries Rabab Alomairy Department of Computer, Electrical and Mathematical Sciences & Engineering King Abdullah University of Science and Technology [email protected] IXPUG Middle East, 2018 Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 1 / 52

ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANALAbstraction Layer for Task bAsed NumericAl Libraries

Rabab Alomairy

Department of Computer, Electrical and Mathematical Sciences & EngineeringKing Abdullah University of Science and Technology

[email protected]

IXPUG Middle East, 2018

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 1 / 52

Page 2: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Outline

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 2 / 52

Page 3: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 3 / 52

Page 4: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 4 / 52

Page 5: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Computing Hardware Evolution

Frequency barrier

Processing units cannot run faster forreasons of energy efficiency

Concurrency makes up for frequencySeveral processing units run togethersuch as:

Multicore and manycoreVector processing extensionsAcceleratorsHeterogeneous architectures

Deep memory hierarchy

More capabilities, more complexity

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 5 / 52

Page 6: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Software Evolution

With this architectural evolution, the burden to design intrinsicalgorithmic parallelism rests on software developers

Parallel programming languages such as MPI+X, which requireprogrammers to

expose parallelism from algorithmmanage computational recourses and communication

Asynchronous task-based runtime systems, which:

relieve developers from managing low-level resources and let themfocus on developing parallel applicationsenhance user productivity

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 6 / 52

Page 7: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 7 / 52

Page 8: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Task-based Runtime Systems

Task-based runtime systems conceptuallysimilar to out-of-order processor scheduling

They provides an automatic parallelization bytracking data dependencies and resolving datahazards at runtime

Task-based runtimes logically operate with adirected acyclic graph (DAG) that:

captures data dependencies betweenapplication taskscaptures tasks read/write data

Task-based runtimes provide additionalinformation such as task priority, loadbalancing, and data locality

Profiling and tracing

Main challenge of these recent runtime systemsis language expressivity

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 8 / 52

Page 9: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 9 / 52

Page 10: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

HiCMA: Hierarchical Computations on ManycoreArchitectures

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 10 / 52

Page 11: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ExaGeoStat: Exascale GeoStatistics

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 11 / 52

Page 12: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

MOAO: Multi-Object Adaptive Optics systems

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 12 / 52

Page 13: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

KSVD: KAUST Singular Value Decomposition

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 13 / 52

Page 14: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Async-AMG: Asynchronous Algebraic Multigrid

Asynchronous task-based parallelization of additive algebraicmulti-grid for solving Ax=b

It exploits the parallelism between levels of the grid hierarchy inadditive AMG

Hybrid MPI+OmpSs (Barcelona Supercomputer Center)

Cray XC40 supercomputer

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 14 / 52

Page 15: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 15 / 52

Page 16: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Ringing Bell

API Standardization

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 16 / 52

Page 17: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL

ALTANAL: Abstraction Layer for Task bAsed NumericAl Libraries

Although the task-based model is quite active, lack of APIstandardization makes it difficult for application or library developersto switch runtimes

A thin layer of abstraction, making the user experience oblivious tothe underneath run-time systems

ALTANAL Goals:

Providing a set of abstractions to facilitates the expression of taskingthat map to a variety of run-time systems such as StarPU, Quark,OmpSs, PaRSEC, OpenMP, and KokkosEnabling exploration of a variety of underlying runtime systemtechnologies and architectures without changing the application codeEnhancing user productivity

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 17 / 52

Page 18: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 18 / 52

Page 19: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 19 / 52

Page 20: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Quark: QUeuing And Runtime for Kernels

ICL, University of TennesseeKnoxville.

Enable dynamic asynchronousexecution of tasks

Targets multi-core, multi-socketshared memory systems

It is highly optimized forPLASMA library

Data locality, priority, andregion access.

Data dependencies: INPUT,

INOUT, OUTPUT

Algorithm 1: Quark STF Tile Cholesky

QUARK New(NUM THREADS);

QUARK Sequence Create (quark);

for k = 0; k < nt; k++ doQUARK Insert Task(quark, POTRF, flg, ...);

for m = k + 1; m < nt; m++doQUARK Insert Task(quark, TRSM, flg, ...);

for n = k + 1; n < nt; n++doQUARK Insert Task(quark, SYRK, flg, ...);

for m = n + 1; m < nt; m++doQUARK Insert Task(quark, GEMM, flg, ...);

QUARK Barrier( quark );

QUARK Sequence Wait (quark, seq);QUAR Delete(quark);

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 20 / 52

Page 21: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Quark: QUeuing And Runtime for Kernels

QUARK Insert Task(quark, POTRF , flags,sizeof(int), &uplo, VALUE,sizeof(int), &n, VALUE,sizeof(double)*nb*nb, &A, INOUT,sizeof(int), &lda, VALUE,sizeof(int), &iinfo, VALUE,,0);

void POTRF (Quark* quark) {quark unpack args 5(quark, &uplo,&n, &A, &lda, &iinfo );potrf(uplo, n, A, lda, iinfo);

}

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 21 / 52

Page 22: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

StarPU

INRIA Bordeaux Sud-Ouest

Enable dynamic task-basedimplementation.

Multicore, Manycoreshared/distributed memory, andheterogeneous architecture

StarPU provides task scheduling andmemory management mechanisms

Based on three principle:

Registering its buffers, to get onehandle per bufferDefining codelets to CPU and GPUimplementationsApplying codelets on some handles

Data dependencies: STARPU R,

STARPU W, STARPU RW

Algorithm 2: StarPU STF Tile Cholesky

starpu init(NULL);

starpu data handle t handle;

starpu matrix data register(&handle, ..)

for k = 0; k < nt; k++ dostarpu insert task(&POTRF, ...);

for m = k + 1; m < nt; m++dostarpu insert task(&TRSM, ...);

for n = k + 1; n < nt; n++dostarpu insert task(&SYRK, ...);for m = n + 1; m < nt; m++do

starpu insert task(&GEMM, ...);

starpu task wait for all();

starpu data unregister (handle);

starpu shutdown();

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 22 / 52

Page 23: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

StarPU

starpu insert task(&dpotrf starpu ,STARPU VALUE, &uplo, sizeof(int),STARPU VALUE, &n, sizeof(int),STARPU RW, &Ahandle,STARPU VALUE, &lda, sizeof(int),STARPU VALUE, &iinfo, sizeof(int),,0);

void POTRF(void *descr[], void *cl args) {A =

STARPU MATRIX GET PTR(descr[0]);starpu codelet unpack args(cl args,&uplo,&n, &lda, &iinfo );potrf(uplo, n, A, lda, iinfo);}

struct starpu codeletdpotrf starpu = {.type = STARPU SEQ,.cpu funcs = {POTRF},.nbuffers = 1,.modes ={ STARPU RW} };

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 23 / 52

Page 24: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

PaRSEC:Parallel Runtime Scheduling and ExecutionController

ICL, University of Tennessee

Dynamic Task Discovery andParametrized Task Graph

Multicore, ManycoreShared/distributed memory, andheterogeneous architecture

Data dependencies: INPUT,

OUTPUT, INOUT

PTG has symbolic DAG and no needto build and store it in memory.

DTD inserts tasks sequentially, andthe DAG is built dynamically duringrun-time and stores it in memory.

Algorithm 3: PaRSEC STF Tile Cholesky

parsec=setup parsec(argc, argv, iparam);dtd tp = parsec dtd taskpool new ();parsec dtd data collection init(&A);parsec enqueue( parsec, dtd tp );parsec context start(parsec);

for k = 0; k < nt; k++ doparsec dtd taskpool insert task(&POTRF, ...);

for m = k + 1; m < nt; m++ doparsec dtd taskpool insert task(&TRSM, ...);

for n = k + 1; n < nt; n++ doparsec dtd taskpool insert task(&SYRK, ...);for m = n + 1; m < nt; m++ do

parsec dtd taskpool insert task(&GEMM, ...);

parsec dtd taskpool wait( parsec, dtd tp );parsec context wait( parsec );parsec taskpool free( dtd tp );

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 24 / 52

Page 25: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

PaRSEC:Parallel Runtime Scheduling and ExecutionController

parsec dtd taskpool insert task(dtd tp, POTRF , priority, ”potrf”sizeof(int), &uplo, VALUE,sizeof(int), &n, VALUE,sizeof(double)*nb*nb, &A, INOUT,sizeof(int), &lda, VALUE,sizeof(int), &iinfo, VALUE,,PARSEC DTD ARG END);

void POTRF(parsec execution stream t *es, parsec task t *this task){parsec dtd unpack args(this task, &uplo,&n, &A, &lda, &iinfo );potrf(uplo, n, A, lda, iinfo);

}

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 25 / 52

Page 26: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

OpenMP: Open Multi-Processing

OpenMP ArchitectureReview Board (or OpenMPARB)

OpenMP 1.x (1997-98),OpenMP 2.x (2000-02)

Thread-based fork-joinprogramming model

OpenMP 3.x (2008-11)

Independent tasks

OpenMP 4.x (2013-15)

Task with dependencies

Multicore, Manycore sharedmemory systems

Data dependencies: in,

out, inout

Algorithm 4: OpenMP STF Tile Cholesky

#pragma omp parallel

#pragma omp master{for k = 0; k < nt; k++ do#pragma omp task depend(inout:A[0:nb])

POTRF(A[k][k]);

for m = k + 1; m < nt; m++ do#pragma omp task depend(in:A[0:nb]) depend(inout:A[0:nb])

TRSM( A[k][k], A[m][k]);

for n = k + 1; n < nt; n++ do#pragma omp task depend(in:A[0:nb]) depend(inout:A[0:nb])

SYRK(A[n][k], A[n][n]);

for m = n + 1; m < nt; m++ do#pragma omp task depend(in:A[0:nb], A[0:nb]) depend(inout:A[0:nb])

GEMM(A[n][k], A[m][k], A[n][m]);

}

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 26 / 52

Page 27: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

OmpSs

Bercelona SupercomputingCenter (BSC)

Name originally comes from:OpenMP and StarSs

OmpSs-2 is secondgeneration of the OmpSs

Multicore, Manycoreshared/distributed memory,and heterogeneousarchitecture

It allows nesting of tasks

Data dependencies: in,

out, inout

Algorithm 5: OmpSs STF Tile Cholesky

for k = 0; k < nt; k++ do#pragma omp inout(A[0:nb])

POTRF(A[k][k]);

for m = k + 1; m < nt; m++ do#pragma omp in(A[0:nb]) inout(A[0:nb])

TRSM( A[k][k], A[m][k]);

for n = k + 1; n < nt; n++ do#pragma omp task in(A[0:nb]) inout(A[0:nb])

SYRK(A[n][k], A[n][n]);

for m = n + 1; m < nt; m++ do#pragma omp task in(A[0:nb], A[0:nb]) inout(A[0:nb])

GEMM(A[n][k], A[m][k], A[n][m]);

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 27 / 52

Page 28: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Summary Table of Existing Task-based Runtime Systems

Task-basedRuntime

Developergroup

DistributedMemory

GPU Scheduler Features Task Dependency ProgrammingInterface

OpenMP OpenMP ARB standard in GNU,pragma directive

STF, implicitDAG, fork-joinmodel

C, C++,Fortran

OmpSs/OmpSs-2

BercelonaSupercomputingCenter

X X Breadth First, WorkFirst, Socket-awarescheduler, Bottomlevel-aware scheduler

STF, implicit DAG C, C++,Fortran

StarPU INRIABordeauxSud-Ouest

X X prio, dm , dmda,eager scheduler, workstealing, priority

STF, implicit DAG C

PaRSEC UTK X X PBQ , ePBQschedulers, workstealing, priority

STF, implicitDAG, PTG,explicit DAG

C, For-tran JDFcompiler

Quark UTK priority and localityhinting, accumulatorand gatherv tasks

STF, implicit DAG C

Kokkos SandiaNationalLaboratories

X execution policy andpattern,memory space andlayout

STF, implicit DAG C++

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 28 / 52

Page 29: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Summary Table of Existing Task-based Runtime Systems

Task-basedRuntime

Developergroup

DistributedMemory

GPU Scheduler Features Task Dependency ProgrammingInterface

Legion StanfordUniversity

X X logical regions,spawning child task,mapping interface

STF, implicit DAG C++,Regentcompiler

HPX STEllAR Group X X work-queuing model,message drivencomputation usingtasked based

explicitparallelism,fork-join model

C, C++

Charm++ University ofIllinois

X X chares, entrymethods,migratability,asynchrony ,structure dagger

explicit DAG C++, cus-tom

OCR Universities\Laboratories

asynchronousevent-driven,work-stealing,priority, work-sharing

explicit, DAG C

Cilk Intel spawn and sync,reducers andhyperobjects, andwork-stealing

Divide andconquer (recursivetasks), fork-joinmodel

C, C++

TBB Intel nested parallelism,work-stealing, rangesandpartitioners,memoryallocators

parallelalgorithms,explicit flowgraphs

C++

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 29 / 52

Page 30: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 30 / 52

Page 31: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

DARMA: Distributed Asynchronous Resilient Models andApplications

DARMA is a C++ abstraction layer forasynchronous many-task (AMT)runtimes

Sandia National Laboratories.

Executing applications with multipledifferent run-time systems to takeadvantage of the strengths andweaknesses of each without modifyinguser code

DARMA software provides two levels:front-end API, and back-end-API

DARMA currently supports OpenMPand Kokkos runtime systems in thebackend

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 31 / 52

Page 32: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

DARMA: Distributed Asynchronous Resilient Models andApplications

DARMA follows the SequentialTask Flow (STF) model toexpress the main building blocks

Data interactions occurs usingspecial handle called anAccessHandle

Asynchrous task can be definedusing create work() function

create work() can be calledwith either a C++11 lambdaor a functor

AccessHandle has well-defineddeterministic permissions on theunderlying data: Modify, Read,None.

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 32 / 52

Page 33: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 33 / 52

Page 34: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 34 / 52

Page 35: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Layer

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 35 / 52

Page 36: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Layer

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 36 / 52

Page 37: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 37 / 52

Page 38: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Interface

High-Level API

ALTANAL Init

ALTANAL Finalize

ALTANAL Enable

ALTANAL Disable

ALTANAL Sequence Create

ALTANAL Sequence Destroy

ALTANAL Insert Task

ALTANAL Unpack Arg

ALTANAL Options Init

Low-Level API

ALTANAL Runtime init

ALTANAL Runtime finalize

ALTANAL Runtime enable

ALTANAL Runtime disable

ALTANAL Runtime sequence create

ALTANAL Runtime sequence destroy

ALTANAL Runtime insert task

ALTANAL Runtime unpack arg

ALTANAL Runtime options init

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 38 / 52

Page 39: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Init

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 39 / 52

Page 40: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Finalize

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 40 / 52

Page 41: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Main API

Algorithm 6: ALTANAL STF Tile Cholesky

ALTANAL Init(ncpus, ngpus);ALTANAL context t *altanal;ALTANAL sequence t *sequence;ALTANAL request t *request;ALTANAL Sequence Create(altanal, &sequence);

for k = 0; k < nt; k++ doALTANAL Insert Task(&ALTANAL CODELETS NAME(POTRF), options, ...);

for m = k + 1; m < nt; m++doALTANAL Insert Task(&ALTANAL CODELETS NAME(TRSM), options, ...);

for n = k + 1; n < nt; n++doALTANAL Insert Task(&ALTANAL CODELETS NAME(SYRK), options, ...);

for m = n + 1; m < nt; m++doALTANAL Insert Task(&ALTANAL CODELETS NAME(GEMM), options, ...);

ALTANAL Sequence Wait(altanal, sequence);ALTANAL Sequence Destroy(altanal, sequence);ALTANAL Finalize();

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 41 / 52

Page 42: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

ALTANAL Main API

ALTANAL CODELETS(POTRF, POTRF CPU)

ALTANAL Insert Task(ALTANAL CODELETS NAME(POTRF), options,ALTANAL VALUE, &uplo, ˙ sizeof(int),ALTANAL VALUE, &n, sizeof(int),ALTANAL INOUT, &A, sizeof(double)*nb*nbALTANAL VALUE, &lda, sizeof(int),ALTANAL VALUE, &iinfo, sizeof(int),,ALTANAL PARAM END);

void POTRF CPU(ALTANAL altanal) {ALTANAL Unpack Args(altanal, &uplo,&n, &lda, &iinfo );potrf(uplo, n, A, lda, iinfo);

}

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 42 / 52

Page 43: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 43 / 52

Page 44: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 44 / 52

Page 45: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Dense Cholesky Factorization

Figure: Dual-socket 18-core Intel(R) Xeon(R) Haswell CPU E5-2699 v3 @ 2.3 GHz with256GB of main memory.

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 45 / 52

Page 46: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Dense Cholesky Factorization

Figure: Dual-socket 28-core Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz 38.5MBwith 256GB of main memory.

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 46 / 52

Page 47: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 47 / 52

Page 48: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Tile Low Rank Cholesky Factorization

Figure: Dual-socket 18-core Intel(R) Xeon(R) Haswell CPU E5-2699 v3 @ 2.3 GHz with256GB of main memory, shared memory system

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 48 / 52

Page 49: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Tile Low Rank Cholesky Factorization

Figure: Dual-socket 28-core Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz 38.5MBwith 256GB of main memory, shared memory system

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 49 / 52

Page 50: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Table of Contents

1 MotivationComputing Hardware and Software EvolutionTask-based Runtime SystemsOverview of Task-based ECRC ProjectsALTANAL

2 Literature ReviewOverview of Existing Asynchronous Task-based Runtime SystemsDARMA

3 ALTANALALTANAL LayerALTANAL Interface

4 Test Cases and ExperimentsDense Cholesky FactorizationTile Low Rank Cholesky Factorization

5 Conclusion and Future Work

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 50 / 52

Page 51: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Conclusion and Future Work

ALTANAL is designed to provide run-time oblivious interface that canbe mapped to many backend run-times (OpenMP, StarPU, Quark,OmpSs, Kokkos, PaRSEC)

It tackles many challenges such as the ability of studying differentruntimes for best standards and practices

ALTANAL API is used in the interface of compute-bound andmemory-bound workloads and enable them to switch between Quark,StarPU, and PaRSEC

ALTANAL currently abstracts Quark , StarPU, and PaRSEC

This research will support many of today's scientific applicationsbased on leading asynchronous dynamic runtime systems

We are targeting other run-time systems such as OpenMP, OmpSs,Kokkos, and TBB

ALTANAL will be extended to benefit from C++ abstractionmechanism

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 51 / 52

Page 52: ALTANAL - Abstraction Layer for Task bAsed NumericAl …...A thin layer of abstraction, making the user experience oblivious to the underneath run-time systems ALTANAL Goals: Providing

Thanks

Rabab Alomairy (KAUST) ALTANAL IXPUG Middle East, 2018 52 / 52