12
Performance Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems Aparna Chandramowlishwaran , Kathleen Knobe ? , Richard Vuduc College of Computing, Georgia Institute of Technology, Atlanta, GA ? Software Solutions Group, Intel Corporation, Hudson, MA Abstract This paper is the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by se- mantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algo- rithms, so we evaluate CnC using two dense lin- ear algebra algorithms in this style for execution on state-of-the-art multicore systems: (i) a recently pro- posed asynchronous-parallel Cholesky factorization al- gorithm, (ii) a novel and non-trivial “higher-level” partly-asynchronous generalized eigensolver for dense symmetric matrices. Given a well-tuned sequential BLAS, our imple- mentations match or exceed competing multithreaded vendor-tuned codes by up to 2.6×. Our evaluation com- pares with alternative models, including ScaLAPACK with a shared memory MPI, OpenMP, Cilk++, and PLASMA 2.0, on Intel Harpertown, Nehalem, and AMD Barcelona systems. Looking forward, we identify new opportunities to improve the CnC language and run- time scheduling and execution. 1 Introduction and Scope We study the use of a novel general-purpose paral- lel programming model, called Concurrent Collections (CnC) [4,12,13]. In CnC, the programmer expresses her computation in terms of high-level application-specific components, partially-ordered only by minimal seman- tic scheduling (data- and control-flow) constraints (Sec- tion 2). This model encourages the programmer to focus on expressing the computation at a high-level without unnecessary serialization and gives the run-time system flexibility in scheduling operations. Although there are several papers about various as- pects of the CnC model, to date there have been no performance demonstrations or evaluations to assess its viability, particularly for high-performance computing (HPC) applications. This paper is the first such perfor- mance study. In particular, we ask whether CnC deliv- ers competitive performance on computations with well- defined performance targets and challenging algorith- mic characteristics. For our evaluation, we select dense linear algebra computations written for multicore sys- tems in an asynchronous-parallel style, by which we mean bulk-synchronous parallel behavior is replaced by more fine-grained task-level parallelism and localized synchronization [5, 7, 15]. This approach (a) is naturally suited to cores with relatively smaller cache or local- store memories, and (b) reduces the degree of synchro- nization, whose cost may reasonably be expected to in- crease with increasing core counts. There are numerous successful demonstrations of this approach for dense lin- ear algebra on current multicore systems [5,7,14], mean- ing there are clear and rigorous performance targets. Contributions and findings. Our central contribu- tion is the first extensive study of the performance poten- tial for HPC applications using the CnC model, based on Intel’s v0.3 Linux CnC implementation for shared mem- ory multicore systems [11]. We discuss which aspects of the CnC language and run-time could be improved. Our study is essential to establishing what the potential is for achieving high performance using CnC. We express and analyze a prior asynchronous-parallel variant of dense Cholesky factorization when written us- ing CnC. When coupled with a well-tuned BLAS, CnC can closely match or exceed the performance and scal- ability of the vendor-tuned Intel Math Kernel Library (MKL). Our CnC-based code also compares favorably to PLASMA 2.0, a state-of-the-art domain-specific library- based approach (Section 3). Both MKL and PLASMA use an asynchronous-parallel approach, and so consti- 1 978-1-4244-6443-2/10/$26.00 ©2010 IEEE

Performance Evaluation of Concurrent Collections on High ...vuduc.org/pubs/chandramowlishwaran2010-ipdps-cnc.pdf · Performance Evaluation of Concurrent Collections on High-Performance

Embed Size (px)

Citation preview

Performance Evaluation of Concurrent Collectionson High-Performance Multicore Computing Systems

Aparna Chandramowlishwaran†, Kathleen Knobe?, Richard Vuduc†

†College of Computing, Georgia Institute of Technology, Atlanta, GA?Software Solutions Group, Intel Corporation, Hudson, MA

Abstract

This paper is the first extensive performance studyof a recently proposed parallel programming model,called Concurrent Collections (CnC). In CnC, theprogrammer expresses her computation in terms ofapplication-specific operations, partially-ordered by se-mantic scheduling constraints. The CnC model iswell-suited to expressing asynchronous-parallel algo-rithms, so we evaluate CnC using two dense lin-ear algebra algorithms in this style for execution onstate-of-the-art multicore systems: (i) a recently pro-posed asynchronous-parallel Cholesky factorization al-gorithm, (ii) a novel and non-trivial “higher-level”partly-asynchronous generalized eigensolver for densesymmetric matrices.

Given a well-tuned sequential BLAS, our imple-mentations match or exceed competing multithreadedvendor-tuned codes by up to 2.6×. Our evaluation com-pares with alternative models, including ScaLAPACKwith a shared memory MPI, OpenMP, Cilk++, andPLASMA 2.0, on Intel Harpertown, Nehalem, and AMDBarcelona systems. Looking forward, we identify newopportunities to improve the CnC language and run-time scheduling and execution.

1 Introduction and ScopeWe study the use of a novel general-purpose paral-

lel programming model, called Concurrent Collections(CnC) [4,12,13]. In CnC, the programmer expresses hercomputation in terms of high-level application-specificcomponents, partially-ordered only by minimal seman-tic scheduling (data- and control-flow) constraints (Sec-tion 2). This model encourages the programmer to focuson expressing the computation at a high-level withoutunnecessary serialization and gives the run-time systemflexibility in scheduling operations.

Although there are several papers about various as-pects of the CnC model, to date there have been noperformance demonstrations or evaluations to assess itsviability, particularly for high-performance computing(HPC) applications. This paper is the first such perfor-mance study. In particular, we ask whether CnC deliv-ers competitive performance on computations with well-defined performance targets and challenging algorith-mic characteristics. For our evaluation, we select denselinear algebra computations written for multicore sys-tems in an asynchronous-parallel style, by which wemean bulk-synchronous parallel behavior is replaced bymore fine-grained task-level parallelism and localizedsynchronization [5,7,15]. This approach (a) is naturallysuited to cores with relatively smaller cache or local-store memories, and (b) reduces the degree of synchro-nization, whose cost may reasonably be expected to in-crease with increasing core counts. There are numeroussuccessful demonstrations of this approach for dense lin-ear algebra on current multicore systems [5,7,14], mean-ing there are clear and rigorous performance targets.

Contributions and findings. Our central contribu-tion is the first extensive study of the performance poten-tial for HPC applications using the CnC model, based onIntel’s v0.3 Linux CnC implementation for shared mem-ory multicore systems [11]. We discuss which aspects ofthe CnC language and run-time could be improved. Ourstudy is essential to establishing what the potential is forachieving high performance using CnC.

We express and analyze a prior asynchronous-parallelvariant of dense Cholesky factorization when written us-ing CnC. When coupled with a well-tuned BLAS, CnCcan closely match or exceed the performance and scal-ability of the vendor-tuned Intel Math Kernel Library(MKL). Our CnC-based code also compares favorably toPLASMA 2.0, a state-of-the-art domain-specific library-based approach (Section 3). Both MKL and PLASMAuse an asynchronous-parallel approach, and so consti-

1

978-1-4244-6443-2/10/$26.00 ©2010 IEEE

tute the current state-of-the-art.For additional comparison, we provide results in al-

ternative programming models, including the “off-the-shelf” solution of ScaLAPACK with a shared mem-ory implementation of MPI (MPICH2+nemesis); Open-MP; and Cilk++. The principal difference between CnCand these models is CnC’s natural support for asyn-chronous execution.1 Our findings quantify the gap be-tween asynchronous-parallel and bulk-synchronous ex-ecution (Section 4).

Finally, we develop a complete CnC implementa-tion of a novel and partly asynchronous-parallel gen-eralized eigensolver for dense symmetric matrices, ofwhich Cholesky is a small component (Section 3). Thisnon-trivial computation is, as far as we know, the firstof its kind. As such, we show that it is feasible to ex-press a complex algorithm within CnC. Our implemen-tation outperforms the Intel MKL equivalent by 1.1–2.6× (Section 4).

Scope. Importantly, this study is about the perfor-mance potential of CnC. Such studies are essential forany new parallel programming model to show value forHPC. Our positive findings show there is potential inCnC as far as performance is concerned.

As an evaluation of CnC for HPC, our use of denselinear algebra limits the findings’ generalizability to oneclass of computations. Still, this class has challengingproperties (e.g., our eigensolver), and so we believe thatthe basic CnC approach could still be an appropriatestarting point for similarly asynchronous-parallel algo-rithms in other areas.

Equally important to questions of performance arethose of productivity. We argue, qualitatively, that CnCis suitable for these computations. However, we stressthat a true assessment is a human-factors question, re-quiring a separate and carefully controlled experiment,and as such is beyond the scope of the present study.

2 Overview of CnCThis section provides a cursory overview of the basic

CnC concepts relevant to our implementation and exper-imental results. Portions of this material are taken fromexisting detailed summaries [4, 11, 12].

The CnC model separates the specification of thecomputation from the expression of its parallelism.This design can simplify the tasks of a domain ex-pert, who is responsible for expressing the compu-tation, from the tasks of a parallelization and tun-ing expert (possibly the same person, a different per-son, or some software/compiler), who identifies the

1We do not use the most advanced features of OpenMP 3.0 andCilk++, nor do we compare directly to the FLAME/SuperMatrix andlibrary-based SMPSs approaches [7, 15]. Nevertheless, for Cholesky,we would expect at best comparable performance to PLASMA.

⊗  Z 

<i,j> 

Outer Product 

<i> 

<j> 

<i,j> 

Figure 1. CnC graphical representation ofthe outer product operation, Z ← x · yT .

parallelism and performs scheduling/distribution andmanages communication/synchronization. CnC com-bines ideas from earlier language work on tuple-spaces,streaming, and data flow models [6, 9, 16] (see Sec-tion 5).

Computation specification. We summarize the ba-sic CnC model by an example. Consider the dense outerproduct computation, Z ← x · yT , where x and y aretwo column vectors and Z is a matrix of appropriate di-mension. Algorithmically, we compute Zi,j ← xi · yj

for all pairs, (xi, yj).The domain expert specifies the computation in a

form that can be represented by a graph, as shown inFigure 1 for the outer product. This graph has 3 kindsof nodes: computational steps, data items, and controltags. Directed edges show producer-consumer relation-ships among these nodes.

A step is the basic unit of execution, which for theouter product is pairwise element multiplication.2 Theblue oval in Figure 1 is a step collection, which stat-ically represents the set of dynamic instances of thesemultiplications.

Data is represented using item collections. Here, x, y,and Z are the three item collections, shown by boxes inFigure 1. Each item collection comprises item instances,which in this case are the elements of the x, y, and Zobjects. These items serve as the basic unit of storage,communication, and synchronization.

Steps may consume items (item → step) or producethem (step → item), shown by directed edges in Fig-ure 1.

Each instance of a step or item has a uniqueapplication-specific identifier, or tag, which is a tupleof tag components. For the outer product, it is natural touse element indices as tags. In Figure 1, we denote the

2We consider this very fine granularity for example only, as in prac-tice one might wish to choose a larger grain, such as a block.

2

tag for x by < i >, y by < j >, and Z by < i, j >.Tag collections (also called tag spaces) specify ex-

actly which instances of a step will execute. A stepcollection is associated with exactly one tag collec-tion/space; a step instance executes only if a matchingtag instance exists. For the outer product, we show thetag space by a triangle and denote it by < i, j >. For in-stance, only if the tag collection has < 3, 10 > does thecorresponding pairwise multiply forZ3,10 ← x3 ·y10 ex-ecute. We say that a tag collection prescribes a step col-lection, and show that visually with a dashed undirectededge connecting the tag collection to the step collection.Multiple step collections may be prescribed by the sametag collection. Importantly, tags indicate whether a stepwill execute, but nothing about when it executes. Thisdistinction shows in part how CnC separates schedulingdecisions from the computation’s specification.

Though not shown here, a step may produce tags. Inthis way, a step may control what other steps execute.This facility is part of what makes CnC a more flexibleand general model than, say, a pure streaming language.

Lastly, Figure 1 contains “squiggly” lines that aremissing either a source or a sink. These lines mean thatthe item or tag comes from or goes to the environment,which is the external code that invokes this computation.For the outer product, the environment provides the dataitems and control tags. (There are other designs; for in-stance, we could have added an additional step that con-sumes data containing, say, the dimensions of the x andy vectors, and then produces the control tags that pre-scribe the pairwise multiply step.)

Textual notation. There is a formal textual represen-tation of this graph. We illustrate this representation inSection 3, when we describe the CnC implementationsof our target dense linear algebra computations. In thecurrent implementation, a translator converts this speci-fication into C++ code, generating subroutine stubs cor-responding to the steps. The programmer must imple-ment these stubs (presumably as purely sequential code).

When the run-time calls the sequential step code, itprovides the tag and data item instances. The step codecalls an API to get the input tags and, if it produces tags,put them back “into” the graph. We refer the interestedreader to the CnC documentation [11].

Semantics and execution. If a step executes and pro-duces an item or a tag, that item or tag becomes avail-able. If a tag collection prescribes a step collection anda particular tag becomes available, then the step is pre-scribed. If all items for a particular step are available, thestep becomes inputs-available. If a step is both inputs-available and prescribed, then it is enabled and may exe-cute. The program terminates when no step is executingand no unexecuted step is enabled. This termination isvalid if all prescribed steps have been executed.

The CnC model permits many run-time system de-signs, including those for distributed memory systemsusing MPI as well as shared memory versions [4]. Weuse Intel CnC 0.3, which is based on the Intel ThreadBuilding Blocks (TBB) [11]. The TBB run-time systemis based on a Cilk-style work stealing scheduler, withwork queues implemented to use last-in first-out (LIFO)order.

In the Intel CnC, there are four types of events: start,complete, idle, and requeue. Start signals the beginningof execution of a step, while complete signals its suc-cessful completion. An idle event is the time spent be-tween the end of one step and start of the next, when thethread is waiting to be scheduled or waiting for data tobecome available.

The requeue event is specific to the Intel CnC. Therun-time may start executing a step as soon as the pre-scribing tag is available. Thus, if some of the step’s in-puts are not yet available, the step may be requeued andtried again later. We revisit requeuing in Section 4.

3 Dense Linear Algebra in CnC

Algorithm 1: Tiled Cholesky factorization algorithmof Buttari, et al. [5].

Input: Input matrix: B, Matrix size: nxn wheren = p∗ b for some b which denotes the tilesize

Output: Lower triangular matrix: L

1 for k = 1 to p do2 ConventionalCholesky(Bkk, Lkk);3 for j = k + 1 to p do4 TriangularSolve(Lkk, Bjk, Ljk);5 for i = k + 1 to j do6 SymmetricRank-

kUpdate(Ljk, Lik, Bij);

In this section, we discuss the CnC implementationsof the asynchronous-parallel variant of dense Choleskyfactorization and a novel asynchronous-parallel densegeneralized eigensolver.

3.1 Asynchronous-Parallel Cholesky

A Cholesky factorization of a symmetric positive def-inite matrix B is the product L · LT , where L is a(lower) triangular matrix. We specifically consider Al-gorithm 1, which is the tiled Cholesky algorithm of But-tari, et al. [5]. This algorithm is based on decompos-ing B into blocks (or tiles), and computes L in an asyn-chronous parallel manner suitable for multicore hierar-chical memory platforms.

Figure 2 illustrates how asynchronous-parallelism

3

B1,1

B2,1

B3,1

B4,1

B2,2

B3,2

B4,2

B3,3

B4,3 B4,4

(a) Initial factorization of the(1, 1) tile.

B1,1

B2,1

B3,1

B4,1

B2,2

B3,2

B4,2

B3,3

B4,3 B4,4

(b) Triangular solve using the(1, 1) tile.

B1,1

B2,1

B3,1

B4,1

B2,2

B3,2

B4,2

B3,3

B4,3 B4,4

(c) Symmetric rank-k updateusing the (2 : 4, 1) tiles.

Figure 2. An illustration of asynchronous Cholesky factorization.

arises in Algorithm 1. We first factor B11 = L11 · LT11

(line 2 of Algorithm 1 for k = 1), using a conventionalsequential Cholesky algorithm. Then, lines 3–4, whichoperate on blocksB21, B31, and B41 can execute in par-allel. Moreover, lines 5–6 suggest that once we com-plete operating on theB21 block, we can do a symmetricrank-k update of block B22 in one thread while anotherthread is still, say, performing the triangular solve onblock B31. Hence, there is a lot of task- and data-levelparallelism in Cholesky.

3.2 Cholesky in CnC

This asynchronous-parallel behavior maps naturallyto the CnC constructs seen in Section 2. Lines 2, 4, and6 in Algorithm 1 map to steps in CnC. The index itera-tion variables of Algorithm 1 constitute a natural choicefor tags which helps distinguish between different dataitems (tiles in this case).

Figure 3 shows the graphical computation specifica-tion of tiled Cholesky in CnC. For simplicity, we omitthe data items from this graph. Below the graph, weshow the textual representation of the graph that the pro-grammer might write. CnC translates the textual repre-sentation into C++ code containing stubs for the pro-grammer to fill in code, as illustrated in Figure 4. Thatis, at this point, all the programmer does is input the ap-propriate tags and data items along with the serial logicof the step.

For the serial step implementation, we call tuned se-quential vendor BLAS routines. This allows us to cou-ple CnC with an optimized serial library to obtain anefficient parallel implementation with minimal codingeffort. Figure 4 shows the actual step code with thecall to dtrsm which performs triangular solve. The APIcalls (Get/Put) before and after the BLAS function callreads in the input tile(s) identified by the tag, performsthe computation and outputs tile(s) with the correspond-ing tag identifier. The input/output and computation per-formed might vary across different steps, but the basic

principle is the same.Note that the choice of tags is important, as it deter-

mines the amount of parallelism exposed. Tag choice islargely natural for dense linear algebra but a poor choiceof tags could impede performance.

Once, the data is available and the step inputting thedata is prescribed by a valid tag identifer, the CnC run-time schedules that particular step instance for execu-tion. Given the freedom to schedule steps, CnC sched-ules them in a way to expose asynchronous-parallel ex-ecution.

As with related approaches, this CnC-based imple-mentation contains a tuning parameter, the block size,which we have assumed the domain expert introducesand selects. This parameter is critical to performancein that it implicitly controls the degree of availableasynchronous-parallelism.

3.3 Generalized Symmetric Eigensolver

Though illustrative, the Cholesky example is fairlycompact. To better understand and assess CnC, we alsodeveloped a tiled and partly asynchronous-parallel sym-metric generalized eigensolver, which is considerablylarger than Cholesky. This is one of the first effortson designing and implementing an asynchronous eigen-solver and most importantly a model like CnC allows usto express the asynchronous-parallel behavior naturallywith relatively modest effort.

Algorithm. To compute the eigenvalues, λ we solvethe linear algebra equation Az = λBz. Here, A issymmetric and B is symmetric positive definite ma-trix. Although this implementation is based on the LA-PACK routine dsygvx, we note that unlike the orig-inal algorithm our implementation in CnC is in factasynchronous-parallel.

The basic algorithm implemented by dsygvx has fourcomponents. First, we compute the Cholesky factoriza-tion B → LLT . Next, we reduce the real symmetric-definite generalized eigenproblem to so-called standard

4

Cholesky

<k> <k,j> <k,j,i>

ConvChol Trisolve Update

// Item: Matrix L, tagged by <k, j, i>// Suppose ’BlockedMatrix<T>’ is a C++ class// that encapsulates the matrix data,// dimension ’n’ and block size ’b’.[BlockedMatrix<double>* L: int, int, int];

// Tag declarations<C_tag : int>; // <k><TS_tag: int, int>; // <k, j><U_tag : int, int, int>; // <k, j, i>

// Step Prescriptions<C_tag> :: (ConvChol);<TS_tag> :: (Trisolve);<U_tag> :: (Update);

// Input from the environmentenv -> [L], <C_tag>;

// Step executions[L] -> (ConvChol);(ConvChol) -> [L], <TS_tag>;

[L] -> (Trisolve);(Trisolve) -> [L], <U_tag>;

[L] -> (Update);(Update) -> [L];

// Output to environment[L] -> env;

Figure 3. Top: CnC graphical notation of Cholesky factorization. The red oval is the con-ventional Cholesky step; the green oval is the triangular solve step; and the grey oval is thesymmetric rank-k update. Bottom: Textual notation of Cholesky factorization. Includes onestatement for each relation in the graph.

form, (L−1 ∗ A ∗ L−T )z = λz. Methods exist forcomputing the eigenvalues directly from the generalizedform. The third component is reduction of the symmet-ric matrix to symmetric tridiagonal form, using an or-thogonal similarity transformation, T = Q′ ∗ A ∗ Q.This step can be decomposed into a number of kernels,including matrix-vector multiplication, symmetric ma-trix vector multipliction, and dot product, among oth-ers. Finally, we extract the eigenvalues from the tridiag-onal matrix using a modified QR algorithm. This step isnot compute intensive and may be computed by a singlethread.

Asynchronous-parallelism. Figure 5 is a directedacyclic graph (DAG) of the first two steps of the eigen-solver for a matrix partitioned into a 3×3 grid of blocks.Nodes represent computation and edges represent de-pendencies among them. Each block is labeled by theappropriate submatrix block coordinates. Note that allnodes at any level of the DAG (highlighted by a greyoval) have no dependencies among themselves. Al-though initially the computation is sequential until rootnode finishes execution, there is abundant parallelismthereafter. As is evident from the figure, different tasks,denoted by different colors, can execute concurrently.

In the eigensolver, we not only execute steps inCholesky asynchronously, but also interleave them withsteps of the reduction to standard form (e.g., right trian-

gular solve, symmetric matrix multiplication). The CnCcode easily expresses this concurrency, and the run-timeexploits that concurrency naturally through its schedul-ing as discussed in Section 4.

Although it is possible to extract parallelism usingCnC from the third component of the eigensolver (re-duction to tridiagonal form), the inherent dependenciesinhibit asynchronous-parallel execution. We would needa different algorithmic approach altogether.

Parallelization. Once the dependencies between thesteps are laid out, it is possible to extract parallelismmore efficiently. To that end, we parallelize all sub-kernels within the first three phases. Unfortunately, oneof the most compute-intensive kernel is the symmet-ric matrix-vector multiply (dsymv), which was not effi-ciently implemented in the BLAS. Hence, for this kernelalone, we manually parallelize the computation in CnC.This trade-off is worth while as we observe a dramaticincrease in performance for a slight increase in program-mer effort.

4 Results and DiscussionIn this section, we first evaluate our CnC-based

Cholesky and symmetric generalized eigensolver imple-mentations on the three state-of-the-art multicore plat-forms shown in Table 1. We then compare their perfor-mance to six other implementations (double-precision)

5

Vendor AMD Intel IntelProc. Model Opteron 8350 Xeon E5405 Xeon X5560Proc. Name Barcelona Harpertown Nehalem

Clock(GHz) 2 2 2.8# Sockets 4 2 2Cores(Threads)/Socket 4(4) 4(4) 4(8)L1 Data Cache 64 KB/core 32 KB/core 32 KB/coreL2 Data Cache 512 KB/core 6 MB/2cores 256 KB/coreShared L3 Cache 2 MB/socket – 8 MB/socketDRAM Capacity 32 GB 4 GB 12 GBDRAM Bandwidth (GB/s) 21.3 21.3 51.2DP Peak Performance (GFlop/s) 128 64 89.6

Table 1. Evaluation platforms for our experiments.

StepReturnValue_t Trisolve(cholesky_graph_t& graph,const Tag_t& TS_Tag) {

char uplo = ’l’, side = ’r’;char transa = ’t’, diag = ’n’;double alpha = 1;

const int k = (TS_Tag[0]);const int j = (TS_Tag[1]);

// For each input item in this step// retrieve the item using the proper tag

// User code to create item tag hereBlockedMatrix<double>* A_block =

graph.L.Get(Tag_t(j, k, k));BlockedMatrix<double>* Li_block =

graph.L.Get(Tag_t(k, k, k+1));

// Get block sizeint b = A_block->getBlockSize ();

// Step implementation logic goes heredtrsm(&side, &uplo, &transa, &diag, &b, &b,

&alpha, Li_block, &b, A_block, &b);

// For each output item for this step// put the new item using the proper tag

// User code to create item tag heregraph.L.Put(Tag_t(j, k, k+1), A_block);

return CNC_Success;}

Figure 4. CnC code for the triangular solvestep of the Cholesky algorithm. The blackand gray text in this code snippet denotethe stubs that are generated automaticallyusing the inputs and outputs defined inthe graph. The code fragments filled in bythe user are indicated in bold (blue colortext). Note: We call tuned BLAS for the se-quential step implementation (dtrsm in thisexample).

B1,1

A1,1 A2,1 A3,1 B3,1B2,1

B3,3A2,1 A3,1 B3,2B2,2

B2,2

B3,2

B3,3

B3,3

A3,3A3,2A2,2A2,1 A3,1

Unblocked Cholesky factorization

Triangular solve (Cholesky)

Symmetric rank-k update

Unblocked reduction to standard form

Right triangular solve

Symmetric matrix multiplication 1

Symmetric rank-2k update

Symmetric matrix multiplication 2

Figure 5. DAG representation of the eigen-solver.

and execution-time bounds based on critical path length.For the non-CnC implementations, we make a “best ef-fort” to do some tuning, and always use sequential MKLwhen possible. Finally, we examine CnC’s schedulingcompared to bulk-synchronous strategies.

We compare the following implementations.Baseline – Sequential MKL: The Intel Math Ker-

nel Library (MKL) implementation of Cholesky factor-ization, dpotrf, run in sequential mode with the inputmatrix in column-major storage. This baseline is highlytuned by Intel. We also measure multithreaded MKL(see below). [In the plots, we use sequential MKL im-plementation as the baseline and show this by hollowcircles.]

Blocked iterative OpenMP + sequential MKL: Weimplemented the tiled Cholesky Algorithm 1, where we

6

(1) distribute the loop in line 3 to get two ‘j’ loop nests,i.e., one for triangular solve and one for symmetric rank-k update; and then (2) use OpenMP to parallelize the ‘j’loops. We then use the highly-tuned sequential MKLfor each block operation. We tune the block size by ex-haustive search for each input size, and report the bestperformance. [plus signs]

Cilk++ 1.0.3 block recursive + sequential MKL:We implemented a blocked recursive Cilk++ implemen-tation. In particular, the entire algorithm including thetriangular solve and rank-k update steps are performedrecursively [2], so as to be able to easily use the Cilk++

thread spawn keyword. The recursive form of each steppartitions the matrix into roughly half in each dimen-sion. We stop recursion at a tunable block size, deter-mined by exhaustive search for each input matrix dimen-sion. We report the best performance. We use sequentialMKL for the leaf kernels. [crosses]

Multithreaded MKL: We use the multithreadedMKL implementation of Cholesky factorization,dpotrf. We report performance on the the numberof cores that delivers the highest performance, up tothe maximum available cores. The input matrix is incolumn-major storage. [hollow diamonds]

ScaLAPACK + shared memory MPI: We useScaLAPACK 1.8.0 with an MPI “tuned” for sharedmemory. In particular, we use MPICH2 1.0.8 compiledwith the Nemesis device. We tune the processor grid,trying all valid configurations for a given number of MPItasks, trying all numbers of tasks, and report the bestperformance. [up-pointing triangles]

PLASMA + sequential MKL: We use the Choleskyimplementation that is part of freely available PLASMA2.0.0 package. There is a block size parameter, whichwe tune for each problem size. Since PLASMA cur-rently does not solve eigenvalue problems, we compareonly against our Cholesky. [downward pointing trian-gles]

CnC + sequential MKL: The CnC implementationof Cholesky using sequential MKL for the steps. Thedata is stored in blocked data layout [2, 5, 7, 10]. Theblock size used in the layout is determined by an ex-haustive search over all possible values for a given inputmatrix size. The block size that achieves the highest per-formance is chosen. [filled squares]

DGEMM Peak: The peak performance (GFlop/s) ofdouble-precision dense matrix multiplication measuredusing all the cores on the system. Rather than bench-marking all sizes, we show a representative GFlop/snumber for n = 10, 000, which gave the best DGEMMperformance on all three platforms among all values ofn that we considered for Cholesky and the eigensolver.[dashed lines]

Theoretical Peak: The theoretical machine-specific

upper-bound on double-precision GFlop/s achievable.[solid lines]

Compilers: For our CnC Cholesky factorization,eigensolver, OpenMP and ScaLAPACK implementa-tions, we use the Intel v11.0 and v10.1 compilers onthe Intel and AMD platforms respectively. We useMKL v10.0.3.020 on Barcelona, v10.2 on Nehalem, andv10.1.0.019 on Harpertown. For our Cilk++ implemen-tation, we use the gcc 4.2 compiler that ships with it.

4.1 Cholesky factorization

Figure 6 presents the performance scalability by ar-chitecture as a function of matrix size for double preci-sion Cholesky factorization. The baseline performanceresults correspond to sequential MKL values. On all ma-chines, we use the number of cores that delivers highestperformance for each matrix. We use the theoretical flopcount of n3/3 when reporting performance.

On Nehalem, our asynchronous-parallel CnCCholesky compares favorably to PLASMA and MKL.The sequential MKL baseline runs at over 10 GFlop/s.By contrast, OpenMP with sequential MKL is 2.8×faster than the baseline; and recursive Cilk++ with se-quential MKL and ScaLAPACK using shared memoryMPI provides only an additional 10% over that. Weobserve that our fully asynchronous-parallel CnC im-plementation using sequential MKL delivers very goodscalability (speedup of nearly 7.3× in comparison to thebaseline), upto 8 threads. Beyond this, HyperThreadingyields no added benefit on all three competiting imple-mentations (MKL, PLASMA and CnC). Nevertheless,CnC Cholesky on Nehalem achieves more than 85% ofthe theoretical peak performance for the largest matrixsize (n = 10, 000) where DGEMM is at 92%.

We make similar observations on Harpertown whoseCore architecture is similar to Nehalam. Unlike Ne-halem, there is no simultanous multithreading (SMT)on Harpertown. Once again, our CnC implementationachieves near perfect scaling, a speedup of 7.5× on 8cores, competing well with MKL and PLASMA imple-mentations. Moreover, we achieve more than 80% of thetheoretical peak performance.

The data on Barcelona also follow similar trends ex-cept, interestingly, Cholesky factorization achieves onlyhalf the theoretical peak performance. Nevetheless, ourCnC implementation delivers performance on par withthe state-of-the-art PLASMA and exceeds multithreadedMKL for large problem sizes. Barcelona performancealso shows good scalability, nearly 11× on 16 cores.(Note: We did check AMD’s BLAS, which was slowerthan MKL.)

In summary, these results show the potential of CnCto exploit the available parallelism, achieving competi-

7

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000100000

10

20

30

40

50

60

70

80

90

100

110

120

130

140

Matrix Size

20

40

60

80

100

Per

cent

age

of T

heor

etic

al P

eak

DGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peak

Theoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/s BaselineScaLAPACK+MPICH2/nemesisOpenMP+MKL(seq)Cilk++ rec+MKL(seq)Multithreaded MKLPLASMA+MKL(seq)CnC+MKL(seq)

(a) 16-core (4x4) AMD Barcelona

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000100000

10

20

30

40

50

60

70

80

90

100

Matrix Size

20

40

60

80

100

Per

cent

age

of T

heor

etic

al P

eakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peak

Theoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/s

(b) 8-core (2x4) Intel Nehalem

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000100000

10

20

30

40

50

60

70

80

Matrix Size

20

40

60

80

100

Per

cent

age

of T

heor

etic

al P

eak

DGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peakDGEMM peak

Theoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/sTheoretical peak GFlop/s

(c) 8-core (2x4) Intel Harpertown

Figure 6. Performance summary of doubleprecision Cholesky factorization: Perfor-mance in GFlop/s (left y-axis) and percent-age of theoretical peak (right y-axis) as afunction of matrix size, comparing sevenimplementations discussed.

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

7

8

Matrix Size

BaselineMultithreaded MKLCnC+MKL(seq)

(a) 16-core (4x4) AMD Barcelona

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Matrix Size

(b) 8-core (2x4) Intel Nehalem

Per

form

ance

(GFl

op/s

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

Matrix Size

(c) 8-core (2x4) Intel Harpertown

Figure 7. Performance summary of dou-ble precision Eigensolver: Performance(GFlop/s) for the three implementationsdiscussed. The flop count used is mea-sured using PAPI performance counters.

8

tive performance with reasonable programming effort.3

4.2 Eigensolver performance

Figure 7 compares the double precision eigensolverperformance on our three machines. For all platforms,we compare: (i) the baseline sequential MKL imple-mentation; (ii) the multithreaded MKL implementation;and (iii) our CnC code. Though all three codes cancompute both eigenvalues and eigenvectors, we computejust the eigenvalues since it is generally recognized thatdgsyvx is best suited to that case.

We observe that CnC delivers significantly higherperformance than multithreaded MKL on all three sys-tems, with speedups of 1.9×, 2.6×, and 1.5× in the bestcase for Nehalem, Barcelona, and Harpertown, respec-tively.

Three factors contribute to this improvement. First,the critical path of the asynchronous-parallel CnC im-plementation is smaller than multithreaded MKL due toa reduced number of synchronizations. Secondly, thesymmetric matrix-vector multiply kernel is not paral-lelized in MKL, as we confirmed by testing. Finally,on Barcelona and Nehalem, NUMA effects likely playan additional role as well, causing the MKL eigensolverimplementation to not scale beyond 1 socket. Hence wecompare against the 4-thread and 8-thread runs only onBarcelona and Nehalem respectively, which was the bestresult we could obtain when searching over all numbersof threads up to 16 (to match the 16 cores on Barcelonaand 8 cores hyperthreaded on Nehalem).

Unlike MKL, in our CnC implementation we man-ually parallelized the symmetric matrix-vector multiplyroutine, thereby enabling the computation to scale up tothe maximum number of cores/threads.

Interestingly, we do not see scalability issues onHarpertown. The MKL eigensolver scales up to 8threads even though the symmetric matrix-vector multi-ply is sequential. Our partly asynchronous-parallel CnCimplementation is up to 1.5× faster than multithreadedMKL. Even though the eigensolver implementation isnon-trivial and much more complex than Cholesky fac-torization, our CnC implementation achieved a highlevel of performance, showing that at least the ba-sic model and run-time system have good potential.We refer interested readers to [8] which contains addi-tional details on comparison of CnC against other asyn-chronous approaches, detailed scalability results, whichwe omit in this paper due to space constraints.

3Although we do not use the most recent version of MKL onBarcelona and Harpertown, we believe the comparisons made are fairin that we use the same MKL for all implementations on a given plat-form.

4.3 Scheduling

The current run-time system is characterized by astatic grain size, dynamic schedule, and dynamic dis-tribution. It is built on top of the TBB [1]. TBB con-trols the scheduling and execution of steps in a CnC pro-gram. TBB implements a Cilk-like work-stealing sched-uler that supports fine-grained task parallelism [3].

Tag generation. In CnC, we have the option either topre-generate tags or generate them on-the-fly. Figure 8depicts an execution timeline for the Cholesky factor-ization of a matrix of size 1000, for two approaches totag generation. Figure 8(a) shows the approach in whichwe pre-generate all tags. Owing to the run-time’s LIFOqueuing (Section 2), the last tag generated will be sched-uled first. That is, for Cholesky, first-in-first-out (FIFO)is preferable. We typically want the first tag value intag space < k > to be scheduled first since all othersteps are stalled until this step finishes execution. Onesolution is to generate tags with data on-the fly shown inFigure 8(b).

We layout the execution profile of each thread alongthe y-axis (one “row” per thread); and on the x-axisshow execution time, normalized in both charts to thetime taken by longest executing thread. The differ-ent color-coded regions represent the different step in-stances.

We make a number of observations. First, the dy-namic tag approach takes only 75% of the time taken bythe pre-generated tags approach. When pre-generatingtags, 30.6% of the overall execution time is spent on re-queue events for the matrix size 1000 (sum of green re-gions). The computation is also not load balanced, withthread 8 completing much earlier than the other threads.When we explicitly generate tags only when data be-comes available, we observe a marked decrease in thenumber of requeue events, thereby yielding the 25% im-provement in time. Moreover, the computation is betterload balanced.

By reducing requeue events, we increase the num-ber of Get and Put operations, but the overhead due tothese operations is much less than the requeue delays, asshown by the decrease in the overall running time.

However, we also observed for larger n, requeue de-lays were actually not that significant. In particular, re-call from Figure 8 that the time spent on the factorizationof the first block B11 is about 20% of the entire execu-tion time. Hence, all other threads waiting for the in-put from the first step are requeued. When n = 6000,less than 1% of the time was spent on factoring the firstblock, and so the time spent on requeue events was only0.56% of the overall execution time. Thus, the on-the-fly approach does not pay-off in all instances and, in fact,becomes a “tunable” parameter.

9

1

2

3

4

5

6

7

8

Unblocked CholeskyTriangular SolveSymmetric Rank-k updateIdleRequeue

Normalized Execution Time

Thre

ad #

0.0 0.2 0.4 0.6 0.8 1.0

(a) Tags pre-generated; requeue time = 30.6%.

1

2

3

4

5

6

7

8

Normalized Execution Time

Thre

ad #

0.0 0.2 0.4 0.6 0.8 1.0

Unblocked CholeskyTriangular SolveSymmetric Rank-k updateIdleRequeueLower-bound on Execution Time

(b) Tags generated when data becomes available.

Figure 8. Scheduling for Cholesky factorization (matrix size = 1000)

1

2

3

4

5

6

7

8

Normalized Execution Time

Thre

ad #

0.0 0.2 0.4 0.6 0.8 1.0

Unblocked CholeskyTriangular SolveSymmetric Rank-k updateIdleLower-bound on Execution Time

Figure 9. Scheduling timeline for Choleskyfactorization using Cilk++.

Comparison to bulk-synchronous approaches. Itis well-established that asynchronous scheduling caneliminate the idle times present in bulk-synchronous ap-proaches. For example, Figure 9 shows the schedulingof Cholesky factorization in our Cilk++-based recursive

implementation. There is a synchronization stage at theend of every task. Thus, the idle time increases becausefast threads are waiting for the slower ones to completeexecution. This behavior is a consequence of our choiceof a recursive implementation, which is encouraged bythe Cilk++ model, as well as the model’s nested DAGparallelism.

In the CnC implementation, there is no synchroniza-tion of tasks and the execution driven by tags imposesonly one condition on preserving the data dependencybetween the steps. This eliminates idle time, as is ev-ident when comparing Figures 8(b) and 9. Also, thesefigures show a vertical dotted line which is the estimatedlower-bound on execution time. Given a weighted DAG(node weights measured as the time spent in the corre-sponding step, and the edge weights to be 0), the lowerbound is computed by finding the longest path from thestart to end (sequential steps along the critical path). Weobserve that CnC performs extremely well, within 10%of its lower bound; by contrast, the bulk-synchronouscode performs well-below its potential.

Figure 10 shows scheduling of the eigensolver onHarpertown. The scheduling figure only shows theasynchronous-parallel portion of the eigensolver, whichhas the Cholesky and reduction to standard form com-ponents. Figure 10(a) shows how the scheduling unfoldswhen all the tags have been generated before the start ofcomputation. In this part of the computation, more than80% of the execution time is spent on requeue events.

10

This behavior is due to the LIFO scheduling, where tagsare scheduled last-in-first-out. Since the number of tagspaces and tags in each tag space are much larger com-pared to Cholesky factorization, the amount of requeueis significantly higher at start. However, the number ofrequeue events for the entire computation is only 33.3%of the execution time. Figure 10(b) shows an 85% re-duction in overall execution time of the zoomed portionby generating tags only when input becomes available.

In short, we can achieve high performance in CnC,but there is still scope for additional improvements andtuning in the run-time system with respect to schedulingof steps, locality, and data movement.

5 Related WorkExisting work on asynchronous-parallel algorithms

for dense linear algebra covers Cholesky, LU, and QRfactorization, as well as so-called “two-sided” transfor-mations, Hessenberg, tridiagonal and bidiagonal reduc-tion [5, 7, 14]. The implementations are based on somecombination of schedulers and APIs based on domain-specific abstraction (e.g., SuperMatrix [7]), or hand-coded or pragma-directed schemes (e.g., SMPSs [14,15]). The present study contributes experience workingin a novel model and a novel asynchronous-parallel im-plementation of a different algorithm (the eigensolver).

The CnC programming model itself has rich influ-ences from the long history of concurrent programmingmodels, including tuple-spaces, streaming languages,and dataflow languages [6, 9, 16]. CnC’s key distinc-tion is its treatment of both control and dataflow, therebymaking CnC more general than pure streaming or classi-cal data flow approaches. Also, item collections in CnCallows for more general indexing than dataflow arrays.While both CnC and tuple space languages like Lindaspecify computation using tags, they differ in a numberof aspects. In CnC there is a clear separation betweentags and values, while there is no distinction betweenthe two in Linda. Moreover items are accessed by valueand not by location, and adhere to dynamic single as-signment form, as noted by Budimlic, et al. [4].

6 Conclusions and Future WorkThis study constitutes the first performance evalua-

tion of the CnC model, with compelling results on achallenging pair of computations from parallel denselinear algebra. CnC complements existing approachesfor expressing and scheduling asynchronous-parallelcomputations, by providing novel abstractions that en-able a variety of control flow and dataflow constructs tobe expressed in a way that enables effective paralleliza-tion. For our target computations, we can both (a) matchor exceed a highly-tuned vendor library for Choleskyand (b) extend these results to significant speedups (1.1–

2.6×) on a complicated eigensolver. Indeed, the CnCmodel enabled our novel asynchronous-parallel eigen-solver implementation.

Our experience reveals ways in which to improveCnC further. First, additional work queue schedulingpolicies (besides LIFO) are needed. Secondly, we canavoid run-time inefficiencies by exploiting additionaldependence information available in the specificationitself (textual notation). In our case, when the sametag collection prescribes multiple dependent step col-lections, we can reduce requeuing by not schedulingthose collections. Thirdly, there are a number of waysin which tag management could be tuned, perhaps au-tomatically. Finally, there are ways to enhance the tex-tual notation and API; we are currently looking at addingnew abstractions as well as new syntax for easily com-posing CnC components.

Acknowledgments

This work was supported in part by the National Sci-ence Foundation (NSF) under award number 0833136,and grants from the Defense Advanced ResearchProjects Agency (DARPA) and Intel Corporation. Anyopinions, findings and conclusions or recommendationsexpressed in this material are those of the authors and donot necessarily reflect those of NSF, DARPA, or Intel.We also wish to thank Geoff Lowney and Frank Schlim-bach of Intel Corporation for helping us collect runtimetraces for the scheduling timelines.

References[1] Intel R© Threading Building Blocks, 2009. www.

threadingbuildingblocks.org.[2] Bjarne S. Andersen, Fred Gustavson, Alexander Karaivanov,

Minka Marinova, Jerzy Wasniewski, and Plamen Yalamov.LAWRA: Linear algebra with recursive algorithms. In Appl. Par.Comput., volume LNCS 1947, pages 38–51. Springer, 2001.

[3] Robert D. Blumofe, Christopher F. Jorg, Bradley C. Kuszmaul,Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: Anefficient multithreaded runtime system. In Proc. PPoPP, pages207–216, 1995.

[4] Zoran Budimlic, Aparna Chandramowlishwaran, KathleenKnobe, Geoff Lowney, Vivek Sarkar, and Leo Treggiari. Multi-core implementations of the Concurrent Collections program-ming model. In Proc. Wkshp. Compilers for Par. Comput. (CPC),2009.

[5] Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Don-garra. A class of parallel tiled linear algebra algorithms for mul-ticore architectures. Technical Report UT-CS-07-600, Univ. ofTenn. Knoxville, Sep. 2007.

[6] Nicholas Carriero and David Gelernter. Linda in context.Comm. ACM, 32(4):444–458, 1989.

[7] Ernie Chan, Enrique S. Quintana-Ortı, Gregorio Quintana-Ortı,and Robert van de Geijn. SuperMatrix out-of-order schedulingof matrix operations for SMP and multi-core architectures. InProc. SPAA, pages 116–125, 2007.

[8] Aparna Chandramowlishwaran, Kathleen Knobe, and RichardVuduc. Performance evaluation of Concurrent Collections onhigh-performance multicore computing systems. Technical Re-

11

1

2

3

4

5

6

7

8

Unblocked CholeskyTriangular Solve (Cholesky)Symmetric Rank-k updateReduction to general formSymmetric matrix multiplicationSymmetric Rank-2k updateTriangular SolveIdleRequeue

Normalized Execution Time

Thre

ad #

0.0 0.2 0.4 0.6 0.8 1.0

(a) Pre-generated tags result in 33.3% of time spent onrequeue events.

1

2

3

4

5

6

7

8

Unblocked CholeskyTriangular Solve (Cholesky)Symmetric Rank-k updateReduction to general formSymmetric matrix multiplicationSymmetric Rank-2k updateTriangular SolveIdleRequeue

Normalized Execution Time

Thre

ad #

0.00 0.05 0.10 0.15

(b) Tags generated on-the-fly.

Figure 10. Scheduling for Eigensolver (matrix size = 1000)

port GT-CSE-10-01, Georgia Institute of Technology, Atlanta,GA, USA, February 2010.

[9] Jack B. Dennis. First version of a data flow procedure language.In Programming Symp.: Proc., volume LNCS 19, pages 362–376. Springer, 1974.

[10] Fred G. Gustavson. New generalized data structures for matri-ces lead to a variety of high performance dense linear algebraalgorithms. In Appl. Par. Comput., volume LNCS 3732, pages11–20, 2006.

[11] Intel R© Concurrent Collections for C/C++: User’s Guide, v0.3,2009.

[12] Kathleen Knobe. Ease of use with Concurrent Collections(CnC). In Proc. USENIX HotPar, 2009.

[13] Kathleen Knobe and Carl D. Offner. TStreams: A model of par-allel computation. Technical Report HPL-2004-78R1, HP Labs,2004.

[14] Hatem Ltaeif, Jakub Kurzak, and Jack Dongarra. Schedulingtwo-sided transformations using algorithms-by-tiles on multi-core architectures. Technical Report UT-CS-09-637, Univ. ofTenn. Knoxville, 2009.

[15] Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Adependency-aware task-based programming environment formulticore architectures. In Proc. IEEE CLUSTER, pages 142–151, 2008.

[16] William Thies, Michal Karczmarek, and Saman Amarasinghe.StreamIt: A language for streaming applications. In Proc. Int’l.Conf. Compiler Construction (CC), volume LNCS 2304, pages49–84. Springer, 2002.

12