24
Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

Embed Size (px)

Citation preview

Page 1: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

Introducing Tpetra and Kokkos

Chris Baker/ORNL

TUG 2009

November 3-5 @ CSRI

Page 2: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

2 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Introducing Tpetra and Kokkos

• Tpetra provides a next generation implementation of the Petra Object Model.– This is a framework for distributed linear algebra objects.– Tpetra is a successor Epetra.

• Kokkos is an API for programming to a generic parallel node.– Kokkos memory model allows code to be targeted to

traditional ( CPU ) and non-traditional ( accelerated ) nodes.– Kokkos computational model provides a set of constructs for

parallel computing operations.

Page 3: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

3 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra Organization

• Tpetra follows the Petra Object Model currently implemented in Epetra:– Map describes the distribution of object data across nodes.– Teuchos::Comm abstracts internode communication.– Import, Export, Distributor utility classes facilitate

efficient data transfer.– Operator, RowMatrix, RowGraph provide abstract

interfaces.– Vector, MultiVector, CrsGraph, CrsMatrix are concrete

implementations that are the workhorse of Tpetra-centered codes.

• Any class with significant data is templated.

• Any class with significant computation uses Kokkos.

Page 4: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

4 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra vs. Epetra

• Most of the functionality of Epetra is present in Tpetra.

• Some differences prohibit a “find-replace” migration:

Epetra Tpetra

Epetra_MpiComm comm(...);Epetra_Map map(numGlobal, 0, comm);

Epetra_CrsMatrix A( Copy, map, &nnz, true );Epetra_Vector x(map), y(map);

A->Apply(x,y);

RCP<Comm> comm = rcp(...);Map<int> map(numGlobal, 0, comm);

CrsMatrix<double,int> A( rcpFromRef(map), nnz, StaticProfile );Vector<float,int> x( rcpFromRef(map) ), y( rcpFromRef(map) );A->apply(x,y);

- Minor interface changes

- Dependency on Kokkos package - Introduction of templated classes

Page 5: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

5 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra Templated Classes

• A limitation of Epetra is that the implementation is tied to double and int.– Deployment of Epetra discourages significant modifications.– Published interface limits the possible implementation changes.

• Clean slate and compiler availability allow Tpetra to address this via template parameters to classes.

• This provides numerous capability extensions:– No 4GB limit: surpassing int enables arbitrarily large problems.– Arbitrary scalar types: float, complex, matrix<5,3>, qd_real

– Greater efficiency.

Page 6: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

6 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra Basic Template Parameters

• Three primary template arguments:– LocalOrdinal, GlobalOrdinal, Scalar

• Scalar enables the description of numerical objects over different fields.– Any mathematically well-defined type is supported.

– Additionally, require support under Teuchos::ScalarTraits and Teuchos::SerializationTraits.

• LocalOrdinal describes local element indices.– Intended to enable efficiency; should be chosen as small as possible.

• GlobalOrdinal describes global element indices.– Intended to enable larger problem sizes.

– Decoupling necessary when the number of nodes is large.

Page 7: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

7 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra Template Examples

Map<LocalOrdinal, GlobalOrdinal>

• global_size_t getGlobalNumElements()

• size_t getNodeNumElements()

• LocalOrdinal getLocalElement(GlobalOrdinal gid)

• GlobalOrdinal getGlobalElement(LocalOrdinal lid)

CrsMatrix<Scalar, LocalOrdinal, GlobalOrdinal>

• global_size_t getGlobalNumEntries()

• size_t getNodeNumEntries()

• void getGlobalRowView(GlobalOrdinal gid, ArrayRCP<GlobalOrdinal> &inds, ArrayRCP<Scalar> &vals)

Page 8: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

8 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Tpetra Advanced Template Parameters

• Other template arguments exist to provide additional flexibility in Tpetra object implementation:– Node template argument specifies a Kokkos node.– Local data structures and implementations also flexible.

Example: CrsMatrix<Scalar, LO, GO, Node, LclMatVec, LclMatSolve>

Scalar Field for matrix values

LO int Type of local indices

GO LO Type of global indices

Node Kokkos::DefaultNode Kokkos node for local operations

LclMatVec Kokkos::DefaultSparseMultiplyImplementation of local sparse mat-vec

LclMatSolve Kokkos::DefaultSparseSolveImplementation of local sparse solve

Page 9: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

9 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Kokkos Parallel Node API

• Want: minimize the effort needed to port Tpetra

• The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture.

• Difficulties are many

• Difficulty #1: Many different memory architectures– Node may have multiple, disjoint memory spaces.– Optimal performance may require special memory placement.

• Difficulty #2: Kernels must be tailored to architecture– Implementation of optimal kernel will vary between archs– No universal binary need for separate compilation paths

Page 10: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

10 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Kokkos Node API

• Kokkos provides two components:– Kokkos memory model addresses Difficulty #1• Allocation, deallocation and efficient access of memory

• compute buffer: special memory allocation used exclusively for parallel computation

– Kokkos compute model addresses Difficulty #2• Description of kernels for parallel execution on a node

• Provides stubs for common parallel work constructs– Parallel for loop

– Parallel reduction

• Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.

Page 11: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

11 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Kokkos Memory Model

• A generic node model must at least– support the scenario involving distinct memory regions– allow efficient memory access under traditional scenarios

• Node provides the following memory handling routines: ArrayRCP<T> Node::allocBuffer<T>(size_t sz);

void Node::copyToBuffer<T>(ArrayView<T> src,

ArrayRCP<T> dest);

void Node::copyFromBuffer<T>(ArrayRCP<T> src,

ArrayView<T> dest);

ArrayRCP<T> Node::viewBuffer<T> (ArrayRCP<T> buff);

void Node::readyBuffer<T>(ArrayRCP<T> buff);

Page 12: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

12 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Kokkos Compute Model

• Have to find the correct level for programming the node:– Too low: code dot(x,y) for each node• Too much work to move to a new platform.

• Effort of writing dot() duplicates that of norm1()

– Too high: code dot(x,y) for all nodes. • Can’t exploit hardware features.

• API becomes a programming language without a compiler.

• Somewhere in the middle:– Parallel reduction is the intersection of dot() and norm1()– Parallel for loop is the intersection of axpy() and mat-vec– We need a way of fusing kernels with these basic constructs.

m kernels * n nodes = m*n

m kernels + 2 constructs * n nodes = m + 2 * n

Page 13: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

13 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

template <class WDP>void Node::parallel_for(int beg, int

end, WDP workdata

);

template <class WDP>WDP::RedouctinTypeNode::parallel_reduce(int beg, int end, WDP workdata );

template <class T, class NODE> struct AxpyOp { const T * y; T * y; T alpha, beta; void execute(int i) { y[i] = alpha*x[i] +

beta*y[i]; }};

template <class T, class NODE>struct DotOp { typedef T ReductionType; const T * x, * y;

T generate(int i) { return x[i]*y[i]; }

T reduce(T x, T y) { return x + y; }

};

Kokkos Compute Model

• Template meta-programming is the answer.– This is the same approach that Intel TBB takes.

• Node provides generic parallel constructs– Node::parallel_for, Node::parallel_reduce

• User fills the holes in the generic construct.

Page 14: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

14 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Nodes and Kernels:How it comes together

• Kokkos developer/Vendor/Hero develops nodes:

• User develops kernels for parallel constructs.

• Template meta-programming does the rest:– TBBNode< DotOp<double> >::parallel_reduce

– CUDANode< ComputePotentials<3D,LJ> >::parallel_for

• Composition is compile-time– OpenMPNode + AxpyOp equivalent to hand-coded OpenMP Axpy.– May not always be able to achieve this feat.

• TBBNode • TPINode • RoadRunnerNode

• CUDANode • SerialNode • YourNodeHere

Page 15: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

15 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Kokkos Linear Algebra Library

• A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects.

• Coded to the Kokkos Parallel Node API

• Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object.

• Implementing a new Node ports Tpetra without any changes to Tpetra.

T Tpetra::Vector<T>::dot(Tpetra::Vector<T> v){ T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll<T>(SUM, lcl);}

Page 16: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

Teuchos MemoryManagement SuiteA User Perspective

Chris Baker/ORNL

TUG 2009

November 3-5 @ CSRI

Page 17: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

17 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Teuchos Memory Management

• The Teuchos utility package provides a number of memory management classes:– RCP: reference counted pointer– ArrayRCP: reference counted array– ArrayView: encapsulates the length of and pointer to an array– Array: dynamically sized array

• Tpetra/Kokkos utilize these classes in place of raw pointers for:– writing bug-free code– writing simple code with simple interfaces

Page 18: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

18 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Teuchos::RCP

• RCP is a reference-counted smart pointer– Provides runtime protection against null dereference– Provides automatic garbage collection– Necessary in the context of exceptions.

• Semantics are those of C pointer

• Tpetra use:– Tracking the ownership of dynamically created objects– Tpetra::Map objects always passed by RCP.– Dynamically created objects always encapsulated in RCP:• RCP<Vector> Vector::getSubView(...)

• Non-persisting situations allow efficient Teuchos::Ptr.

Page 19: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

19 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Teuchos::ArrayRCP

• ArrayRCP is a reference-counted smart array– T* holds double duty in C: pointer and pointer to array– RCP is for the former; ARCP is for the latter

• Semantics are those of C array/pointer– access operators: [] * ->– arithmetic operators: + - ++ -- += -= – all operations are bounds-checked in debug mode– iterators are available for optimal release performance

• Tpetra/Kokkos use:– Allocated arrays always encapsulated in ARCP before return.– Used heavily in Kokkos for compute buffers and their views.

Page 20: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

20 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Example: ARCP and Kokkos Buffers

• The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model.

• In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well.– Would need to be manually called by user.– This requires the ability to identify when the buffer can be freed.– ArrayRCP allows Node to register a custom, Node-appropriate

deallocator and additional bookkeeping data.

ArrayRCP<T> Node::allocBuffer<T>(size_t sz);

Page 21: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

21 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Example: ARCP and Kokkos Buffers

– In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory.• This requires manually tracking when the view has expired.

• Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping.

– This is especially helpful in the context of Tpetra.

• Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator.

• As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view.

ArrayRCP<T> Node::viewBuffer<T>(ArrayRCP<T> buff);

Page 22: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

22 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Teuchos::ArrayView

• RCP is sometimes overkill; non-persisting relationships can get away with Ptr.

• Non-persisting relationships of array data similarly utilize the ArrayView class. – This class basically encapsulate a pointer and a size.– Supports a subset of C array semantics

• Optimized build results in very fast code. – No garbage collection overhead.– Iterators become C pointers.

• Well integrated with other classes– Easily returned by ArrayRCP and Array

Page 23: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

23 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Teuchos::Array

• Array is a replacement for std::vector.

• The benefit of Array is integration with other Teuchos memory classes.

vector<int> data(...);int * myalloc = NULL;myalloc = func2( &vector[offset], size); int * func2(int A[], int length) { int sum = accumulate( A, A+length, 0 ); return new int[sum];}

Array<int> data(...);ARCP<int> myalloc;Myalloc = Func2( data(offset,size) );

ArrayRCP<int> func2(ArrayView<int> A){ int sum = accumulate( A.begin(), A.end(), 0 ); return arcp<int>(sum);}

Page 24: Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

24 Managed by UT-Battellefor the U.S. Department of Energy Presentation_name

Benefits of use

• Initial release of Tpetra contained no pointers:– Replaced by RCP, ArrayRCP or appropriate iterator– Zero memory overhead w.r.t Epetra.– Almost made me a lazier developer

• Debugging abilities are excellent:– Extends beyond normal bounds checking; can put additional

constraints on memory access.– Runtime build results in code that is as fast as C.

• These memory utilities are unique to Trilinos.– Research-level capability– Production-level quality