26
Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding author Bérenger Bramas, [email protected] Academic editor Gang Mei Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.247 Copyright 2020 Bramas and Ketterlin Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering Bérenger Bramas 1 ,2 and Alain Ketterlin 1 ,2 ,3 1 CAMUS, Inria Nancy - Grand Est, Nancy, France 2 ICPS Team, ICube, Illkirch-Graffenstaden, France 3 Université de Strasbourg, Strasbourg, France ABSTRACT The task-based approach is a parallelization paradigm in which an algorithm is transformed into a direct acyclic graph of tasks: the vertices are computational elements extracted from the original algorithm and the edges are dependencies between those. During the execution, the management of the dependencies adds an overhead that can become significant when the computational cost of the tasks is low. A possibility to reduce the makespan is to aggregate the tasks to make them heavier, while having fewer of them, with the objective of mitigating the importance of the overhead. In this paper, we study an existing clustering/partitioning strategy to speed up the parallel execution of a task-based application. We provide two additional heuristics to this algorithm and perform an in-depth study on a large graph set. In addition, we propose a new model to estimate the execution duration and use it to choose the proper granularity. We show that this strategy allows speeding up a real numerical application by a factor of 7 on a multi-core system. Subjects Algorithms and Analysis of Algorithms, Distributed and Parallel Computing, Scientific Computing and Simulation Keywords Task-based, Graph, DAG, Clustering, Partitioning INTRODUCTION The task-based (TB) approach has become a popular method to parallelize scientific applications in the high-performance computing (HPC) community. Compared to the classical approaches, such as the fork-join and spawn-sync paradigms, it offers several advantages as it allows to describe the intrinsic parallelism of any algorithms and run parallel executions without global synchronizations. Behind the scenes, most of the runtime systems that manage the tasks use a direct acyclic graph where the nodes represent the tasks and the edges represent the dependencies. In this model, a task becomes ready when all its predecessors in the graph are completed, which causes the use a local synchronization mechanism inside the runtime system to manage the dependencies. There are now many task-based runtime systems (Danalis et al., 2014; Perez, Badia & Labarta, 2008; Gautier et al., 2013; Bauer et al., 2012; Tillenius, 2015; Augonnet et al., 2011; Bramas, 2019b) and each of them has its own specificity, capabilities and interface. Moreover, the well-known and How to cite this article Bramas B, Ketterlin A. 2020. Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering. PeerJ Comput. Sci. 6:e247 http://doi.org/10.7717/peerj-cs.247

Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Submitted 10 September 2019Accepted 20 November 2019Published 13 January 2020

Corresponding authorBérenger Bramas,[email protected]

Academic editorGang Mei

Additional Information andDeclarations can be found onpage 22

DOI 10.7717/peerj-cs.247

Copyright2020 Bramas and Ketterlin

Distributed underCreative Commons CC-BY 4.0

OPEN ACCESS

Improving parallel executions byincreasing task granularity in task-basedruntime systems using acyclic DAGclusteringBérenger Bramas1,2 and Alain Ketterlin1,2,3

1CAMUS, Inria Nancy - Grand Est, Nancy, France2 ICPS Team, ICube, Illkirch-Graffenstaden, France3Université de Strasbourg, Strasbourg, France

ABSTRACTThe task-based approach is a parallelization paradigm in which an algorithm istransformed into a direct acyclic graph of tasks: the vertices are computational elementsextracted from the original algorithm and the edges are dependencies between those.During the execution, the management of the dependencies adds an overhead that canbecome significant when the computational cost of the tasks is low. A possibility toreduce the makespan is to aggregate the tasks to make them heavier, while having fewerof them, with the objective of mitigating the importance of the overhead. In this paper,we study an existing clustering/partitioning strategy to speed up the parallel executionof a task-based application. We provide two additional heuristics to this algorithm andperform an in-depth study on a large graph set. In addition, we propose a newmodel toestimate the execution duration and use it to choose the proper granularity. We showthat this strategy allows speeding up a real numerical application by a factor of 7 on amulti-core system.

Subjects Algorithms and Analysis of Algorithms, Distributed and Parallel Computing, ScientificComputing and SimulationKeywords Task-based, Graph, DAG, Clustering, Partitioning

INTRODUCTIONThe task-based (TB) approach has become a popular method to parallelize scientificapplications in the high-performance computing (HPC) community. Compared to theclassical approaches, such as the fork-join and spawn-sync paradigms, it offers severaladvantages as it allows to describe the intrinsic parallelism of any algorithms and runparallel executions without global synchronizations. Behind the scenes, most of the runtimesystems that manage the tasks use a direct acyclic graph where the nodes represent the tasksand the edges represent the dependencies. In this model, a task becomes ready when allits predecessors in the graph are completed, which causes the use a local synchronizationmechanism inside the runtime system to manage the dependencies. There are now manytask-based runtime systems (Danalis et al., 2014; Perez, Badia & Labarta, 2008; Gautier etal., 2013; Bauer et al., 2012; Tillenius, 2015; Augonnet et al., 2011; Bramas, 2019b) and eachof them has its own specificity, capabilities and interface. Moreover, the well-known and

How to cite this article Bramas B, Ketterlin A. 2020. Improving parallel executions by increasing task granularity in task-based runtimesystems using acyclic DAG clustering. PeerJ Comput. Sci. 6:e247 http://doi.org/10.7717/peerj-cs.247

Page 2: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

widely used OpenMP standard (OpenMP Architecture Review Board, 2013) also supportsthe tasks and dependencies paradigm since version 4. The advantage of the method toachieve high-performance and facilitate the use of heterogeneous computing nodes hasbeen demonstrated by the development of many applications in various fields (Sukkari etal., 2018;Moustafa et al., 2018; Carpaye, Roman & Brenner, 2018; Agullo et al., 2016; Agulloet al., 2017; Agullo et al., 2015;Myllykoski & Mikkelsen, 2019).

However,multiple challenges remain open to bring the task-based approach to non-HPCexperts and to support performance portability. In our opinion, the two main problemson a single computing node concern the scheduling and granularity. The scheduling isthe distribution of the tasks over the processing units, i.e., the selection of a task amongthe ready ones and the choice of a processing unit. This is a difficult problem, especiallywhen using heterogeneous computing nodes as it cannot be solved optimally in general.Much research is continuously conducted by the HPC and the scheduling communitiesto provide better generic schedulers (Bramas, 2019a). The granularity issue is related tothe size of the tasks. When the granularity is too small, the overhead of task management,and the potential data movements, becomes dominant and can dramatically increasethe execution time due to the use of synchronizations (Tagliavini, Cesarini & Marongiu,2018). On the other hand, when the granularity is too large, it reduces the degree ofparallelism and leaves some processing units idle. Managing the granularity can be doneat different levels. In some cases, it is possible to let the developer adapt the originalalgorithms, computational kernels, and data structures, but this could require significantprogramming effort and expertise (Bramas, 2016). An alternative, as we aim to study in thispaper, is to investigate how to cluster the nodes of task graphs to increase the granularity ofthe tasks and thus obtain faster execution bymitigating the overhead from themanagementof the dependencies. An important asset of this approach is that working at the graph levelallows creating a generic method independent from the application and what is done atthe user level, but also independent of the task-based runtime system that will be usedunderneath.

While graph partitioning/clustering is a well-studied problem, it is important to note thatthe obtained meta-DAG (direct acyclic graph) must remain acyclic, i.e., the dependenciesbetween the cluster of nodes should ensure to be executable as a graph of tasks, and keepa large degree of parallelism. Hence, the usual graph partitioning methods do not workbecause they do not take into account the direction of the edges (Hendrickson & Kolda,2000). Moreover, the DAG of tasks we target can be of several million nodes, and we needan algorithm capable to process them.

In the current study, we use a generic algorithm that has been proposed to solve thisproblem (Rossignon et al., 2013), and we refer to it as the general DAG clustering algorithm(GDCA).

The contributions of the paper are the following:

• We provide two variants of the GDCA, which change how the nodes are aggregated andallow to have clusters smaller than the targeted size;

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 2/26

Page 3: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

1An implementation of our method ispublicly available at https://gitlab.inria.fr/bramas/dagpar. Besides, we providethe test cases used in this paper at https://figshare.com/projects/Dagpar/71198.

• We provide a new model to simulate the execution of a DAG, by considering that thereare overheads in the execution of each task, but also while releasing or picking a task,and we use this model to find the best clustering size;• We evaluate and compare DGCA and our approach on a large graph set using emulatedexecutions;• We evaluate and compare DGCA and our approach on Chukrut (ConservativeHyperbolic Upwind Kinetic Resolution of Unstructured Tokamaks) (Coulette et al.,2019) that computes the transport equation on a 2D unstructured mesh.

The paper is organized as follows. In ‘RelatedWork’, we summarize the related work andexplain why most existing algorithms do not solve the DAG of tasks clustering problem.Then, in ‘Problem Statement and Notations’, we describe in detail the DAG of tasksclustering problem. We introduce the modifications of the GDCA in ‘DAG of TasksClustering’. Finally, we evaluate our approach in ‘Experimental Study’.

The source code of the presented method and all the material needed to reproduce theresults of this paper are publicly available1.

RELATED WORKPartitioning or clustering usually refers to dividing a graph into subsets so that the sumof costs on edges between nodes in different subsets is minimum. However, our objectivehere is not related to the costs of the edges, which we consider null, but to the executiontime of the resulting graph in parallel considering a given number of threads and runtimeoverhead. Hence, while it is generally implicit that partitioning/clustering is related to theedge cut, we emphasize that it should be seen as a graph symbolic transformation and thatthe measure of quality and final objective differ depending on the problem to solve.

Partitioning tends to refer to finding a given number of subgraphs, which is usuallymuch lower than the number of nodes. In fact, once a graph is partitioned, it is usuallydispatched over different processes and thus there must be as many subgraphs as thereare processes, whereas clustering is more about finding subgraphs of a given approximatesize or bounded by a given size limit, where nodes are grouped together if it appears thatthey have a certain affinity. This is a reason why the term clustering is also used to describealgorithms that cluster indirect graphs by finding hot spots (Schaeffer, 2007; Xu & Tian,2015; Shun et al., 2016). Moreover, the size of a cluster is expected to be much lower thanthe number of nodes.

The granularity problem of the DAG of tasks with a focus on the parallel execution hasbeen previously studied. Sarkar and Hennessy designed a method to execute functionalprograms at a coarse granularity because working at fine granularity, i.e. at the instructionlevel, was inefficient on general purpose multiprocessors (Sarkar & Hennessy, 1986; Sarkar,1989). They proposed a compile-time clustering approach to achieve the trade-off betweenparallelism and the overhead of exploiting parallelism and worked on graphs obtaineddirectly from the source code. As we do in the current paper, they focused on theperformance, i.e. best execution time, as a measure of the quality of the clustering and theirestimation of execution times were based on the number of processors, communication

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 3/26

Page 4: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

2The thesis that includes this final version iswritten in French.

and scheduling overhead. However, their clustering algorithm is different from ours. Itstarts by considering each node as a cluster and successively merges them until it obtains asingle subgraph while keeping track of the best configuration found so far to be used at theend. By doing so, their algorithm has above quadratic complexity and thus is unusable toprocess very large graphs. Also, in our case, we do not take communication into account,and we consider that some parts of the scheduling overhead are blocking: no threads canpeek a task when another thread is already interacting with the scheduler.

More recently Rossignon et al. (2013) proposed GDCA to manage DAG of fine graintasks onmulti-core architectures. Their first solution is composed of three main algorithmscalled sequential, front and depth front. The sequential algorithm puts together a task thathas only one predecessor with its parent. The front algorithm reduces the width of thegraph at each level. The depth front performs a breadth-first traversal of the descendantsof a task to aggregate up to a given number of tasks together. They extended this lastalgorithm (Rossignon, 2015) by proposing a generic method, GDCA, that we use in thecurrent study2. The authors also provided a new execution model to simulate the executionof a DAG where they use an overhead per task and a benefit coefficient for aggregation.

In a more general context, the community has focused on indirect graph partitioning.A classical approach, called the two-way partitioning, consists in splitting a graphinto two blocks of roughly equal size or in minimizing the edge cut between the twoblocks (Kernighan & Lin, 1970; Fiduccia & Mattheyses, 1982). The method can be appliedrecursively multiple times until the desired number of subgraphs is reached. Later, multi-way methods have been proposed (Hendrickson & Leland, 1995; Karypis & Kumar, 1998;Karypis et al., 1999) and most of them have been done in the context of very-large-scaleintegration (VLSI) in an integrated circuit. The motivation is to partition large VLSInetworks into smaller blocks of roughly equal size to minimize the interconnectionsbetween the blocks. The multi-way partitioning has been improved by taking into accountthe direction of the edges in the context of Boolean networks (Cong, Li & Bagrodia, 1994).The authors showed that considering the direction of the edges is very helpful, if notmandatory, in the design in order to have acyclic partitioning.

The problem of acyclic DAG partitioning has also been studied by solving the edge-cut problem, i.e., by minimizing the number of weights of the edges having endpoints indifferent parts and not by focusing on the execution time (Herrmann et al., 2017).We arguethat the execution time is the only criteria that should be evaluated and that measuring theedge-cut coefficient is not accurate to estimate the benefit of the clustering.

Other studies have focused on partitioning with special constraints, such as findinga minimum cost partition of the graph into subsets of size less than or equal to acriteria (Kernighan, 1971), which can be seen as dividing a program into pages of fixed sizeto minimize the frequency of inter-page transitions. The problem also exists with FPGA,where a complete program does not fit in the field and thus should be divided in sub-partswith the objective of minimizing the re-programming (Purna & Bhatia, 1999). In linearalgebra, the LU factorization can be represented as a tree graph that can be partitioned inlinear time (Pothen & Alvarado, 1992).

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 4/26

Page 5: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

The real application we used to assess our method solves the transport equation onunstructured meshes. Task-based implementations to solve the transport equation ona grid (i.e. structured and regular mesh) have already been proposed by Moustafa et al.(2018). The authors have created a version on top of the ParSEC runtime system wherethey partitioned the mesh and avoided working on the graph of tasks. Working on themesh is another way to partition the graph, but this was possible in their case because thedependencies on the mesh were known and regular. The dependencies were not impactedby the clustering because an inter-partition dependency would simply be transformedinto the exchange of a message. In other words, a process could work on its sub-grapheven if some of its nodes are pending for input that will be sent by other processes. Thisis quite different from our approach, as we consider that a sub-graph is transformed intoa macro-task, and hence all input dependencies must be satisfied before a thread starts towork on it.

PROBLEM STATEMENT AND NOTATIONSConsider aDAGG(V ,E) where the verticesV are tasks and the edgesE are the dependenciesbetween those. The clustering problem of a DAG of tasks consists in finding the bestclusters to obtain the minimal makespan possible when the DAG is executed on a specifichardware or execution model. Implicitly, the hardware or execution model should havesome overheads, which could come from the management of the tasks for example, orthe minimal execution time will always be obtained without clustering, i.e. on irrealistichardware without overhead, any clustering of tasks will reduce the degree of parallelismwithout offering any advantages. Finding the optimal solution is NP-complete (Johnson &Garey, 1979) because it requires to test all the possible combinations of clusters. Moreover,evaluating a solution is performed by emulating a parallel execution, which has a complexityof O(V log (W )+E), whereW is the number of workers and usually considered constant.

In this paper, we solve a sub-problem that we call the clustering problem of a DAGof tasks with no-communications since we consider that the weights of the edges areinsignificant and the edges are only here to represent the dependencies. This problem ismet when moving data has no cost or is negligible, which is the case if the workers arethreads and the NUMA effects negligible or if we use processes but have a way to hidecommunication with computation.

Classical partitioning algorithms for indirected graphs cannot be used because they willnot obtain an acyclic macro-graph. Formally, a macro-graph remains acyclic if for any edgea→ b the corresponding clusters C(a)≤ C(b) (note that C(a)= C(b) means that a and bare in the same cluster). This is also know as the convexity constraint (Sarkar & Hennessy,1986) where we say that a subgraph H of graph G is convex if any path P(a,b), with a, b∈H , is completely contained in H . Consequently, one way to solve the problem would beto find a valid topological order of the nodes and divide it into clusters.

We write M the desired size of clusters, which should be seen as an upper limit suchthat no cluster will have more thanM nodes.Parallel efficiency and edge cut. There is a relation between the edge-cut and the parallelexecution when the edges represent communication costs between cluster owners.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 5/26

Page 6: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

(a) (b)

Figure 1 Example of clustering a DAG of 7 nodes targeting cluster ofM = 2 nodes. If each edge has aweight of 1, the cut cost is 7 for (A) and 8 for (B). If each node is a task of cost 1 and edges are not takeninto account, the parallel execution duration is 7 units for (A) and 5 units for (B). If each node is a task ofcost 1 and edges are considered as communication of 1 unit sent sequentially after completion of the clus-ter, the parallel execution duration is 11 units for (A) and 9 units for (B).

Full-size DOI: 10.7717/peerjcs.247/fig-1

(a) (b)

Figure 2 Example of clustering a DAG of 7 nodes targeting cluster ofM = 2 nodes. If each edge has aweight of 1, the cut cost is 4 for (A) and 5 for (B). If each node is a task of cost 1 and edges are not takeninto account, the parallel execution duration is 7 units for (A) and 5 units for (B). If each node is a task ofcost 1 and edges are considered as communication of 1 unit sent sequentially after completion of the clus-ter, the parallel execution duration is 10 units for (A) and 6 units for (B).

Full-size DOI: 10.7717/peerjcs.247/fig-2

However, looking only at the edge-cut is not relevant when the final and only objectiveis the parallel efficiency. Moreover, this is even truer in our case because the weights ofthe edges are neglected. To illustrate the differences, we provide in Figs. 1 and 2 examplesthat demonstrate that when attempting to obtain a faster execution, the edge-cut is not themost important feature. In both examples, the configuration with the lowest edge-cut isslower when executed in parallel whether communications are taken into account or not.Clustering and delay in releasing dependencies. Traditionally, graph partitioning is usedto distribute the workload on different processes while trying tominimize communications.In our case, however, a cluster is a macro-task and is managed like a task: it has to wait

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 6/26

Page 7: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

(a) (b) (c)

Figure 3 Example of clustering three DAGs of eight nodes targeting in two cluster ofM = 4 nodes.The obtained meta-DAG is the same despite the original dependencies between the nodes, and the clusteron the right will have to wait that the cluster on the left is fully executed to become ready. (A) Graph witha dependency between the two first nodes. (B) Graph with dependencies between first and last nodes. (C)Graph with multiple dependencies.

Full-size DOI: 10.7717/peerjcs.247/fig-3

for all its dependencies to be released to become ready, and it releases its dependenciesonce it is entirely completed and not after the completion of each task that composes it.This means the number of dependencies that exists originally between the tasks of twomacro-tasks is not relevant, because if there is one or many then the two macro-tasks arelinked, as illustrated by the Fig. 3. A side effect is that creating macro-tasks delays the releaseof the dependencies. In fact, the release of the dependencies can be delayed by the completeduration of the macro-task compared to execution without clustering and this delay alsoimplies a reduction of the degree of parallelism. However, if the degree of parallelism at aglobal level remains high, the execution could still be faster because it is expected that theclustering will reduce the overhead.

DAG OF TASKS CLUSTERINGModification of the GDCABefore entering into the details of our approach, we first give a general description ofthe original algorithm. The algorithm continuously maintains a list of ready tasks byprocessing the graph while respecting the dependencies. It works on the ready tasks onlyto build a cluster. By doing so, all the predecessors of the ready tasks are already assignedto a cluster, and all the ready tasks and their successors are not assigned to any cluster.This strategy ensures that no cycle will be created while building the clusters. To createa cluster, the algorithm first picks one of the ready tasks, based on a heuristic that wecall the initial-selection. Then it iteratively aggregates some ready nodes to it until thecluster reaches M nodes, using what we call aggregate-selection. Every time a node is putin a cluster, the algorithm releases its dependencies and potentially adds new ready tasksto the list.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 7/26

Page 8: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

x

(a) (b) (d)(c)

y z

vu

x y z

vu

x y z

vu

x y z

vu

Figure 4 Illustration of the GDCA. Ready tasks are in blue, tasks assigned to the new cluster are inred. (A) Nodes x , y and z are ready, and here y is selected as first node for the current cluster p to create(initial-selection). (B) Nodes x , v and z are ready, and we have to select the next nodes to put in p(agregate-selection). If the criteria to decide is the number of predecessors in the cluster, then v is selected.(C) Nodes x and z are ready, and both nodes have zero predecessors in p. If we look at the successorsthey have in common with p, then u is selected. (D) node z is ready, and it might have no predecessorsor successors in common with nodes in p. If we use a strict cluster size, then z should be added to p,otherwise, the cluster p is done.

Full-size DOI: 10.7717/peerjcs.247/fig-4

The initial-selection and the aggregate-selection are the two heuristics that decide whichnode to select to start a cluster, and which nodes to add to the ongoing cluster. Theinitial-selection picks the node with the lowest depth, where the depth of a node is thelongest distance to the roots. The aggregate-selection picks the tasks that has the largestnumber of predecessors in the new cluster. For both selections, the lowest ids is selected incase of equality to enforce a strict order of the nodes.

In Figure 4, we provide an example of clustering that illustrates why we potentially needadditional heuristics for the aggregate-selections. In Fig. 4B both nodes x and z are readyand could be selected, but node z has no connections with the new cluster. The originalalgorithm does not have a mechanism to detect this situation. Second, in Fig. 4D sincez has no connections with the new cluster, it could be disadvantageous to add it. If weimagine the case where a graph is composed of two independent parts, connecting them islike putting a synchronization on their progress. On the other hand, if we need clusters ofa fixed size, as it is the case in the original algorithm, there is no choice and z must be putin the cluster.

In Appendix, we provide two graphs that were clustered using GDCA, see Figs. A1 andA2.Change in the initial-selection. We propose to select the node using the depth (like inthe original algorithm), but also to use the number of predecessors. Our objective is toprivilege the nodes with the highest number of predecessors that we consider more critic.Change in the aggregate-selection. To select the next node to add to a cluster, we choosethe node with the highest number of predecessors in the cluster, but in case of equality,we compare the number of nodes in common between the successors of the cluster andthe successors of the candidates. For instance, in Fig. 4A, the node x has one commonsuccessor with the new cluster (node u), but as the number of predecessors in the cluster ismore significant v is selected. Then, in Fig. 4B, x has one common successor, and z none,therefore, with this heuristic x is selected.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 8/26

Page 9: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Flexible cluster size. If no nodes in the ready list have a predecessor in the cluster or acommon successor, then we can decide to stop the construction of the current cluster andstart a new one, which means that some clusters will have less thanM nodes. This heuristicwould stop the construction of the new cluster in Fig. 4D.Full algorithm. The complete solution is provided in Algorithm 1. The GDCA algorithmis annotated with our modifications and we dissociate GDCAv2 that includes the changein the nodes selections, and GDCAws that includes the stop when no ready node has aconnection with the new cluster.

Themain loop, line 6, ensures that the algorithm continues until there are nomore readynodes in the list. The initial-selection is done at line 12, where the algorithm compares thedepth, the number of predecessors and the ids of each node. This can be implemented usinga sort or simply by selecting the best candidate as the list will be rearranged and updatedlater in the algorithm. At line 16, the dependencies of the master node are released. Ifsome of its predecessors become ready (line 18), they are put in the list and their countersof predecessors in the new cluster are incremented. Otherwise (line 21), they are putin a set that includes all the successors of the new cluster. This set is used line 27, tocount the common successors between the cluster and each ready node. In an optimizedimplementation, this could be done only if needed, i.e. if the best nodes have equal numberof predecessors in the cluster during the aggregate-selection. At line 33, the ready nodesare sorted using their number of predecessors in the cluster, their number of commonsuccessors with the cluster and their ids. If we have flexible cluster size, we can stop theconstruction of the new cluster (line 35) if we consider that no nodes are appropriate.Otherwise, the next node is added to the cluster, its dependencies are released and thecounter updated (from line 38 to line 38).

In Appendix, we provide an example of emulated execution of a DAG using this method,see Fig. A3.

Emulating the execution of DAG of tasksIterating on a DAG to emulate an execution with a limited number of processing units isa classical algorithm. However, how the overhead should be defined and included in themodel is still evolving (Kestor, Gioiosa & Chavarra-Miranda, 2015). We propose to takeinto account three different overheads: one overhead per task execution, and two overheadsfor the release and the selection of a ready task.

Using an overhead per task is classically admitted in the majority of models. In our case,this overhead is a constant per task - no matter the task’s size - and only impacts the workerthat will execute the task. For example, if the overhead per task isOt and a worker starts theexecution of a task of duration d at time t , the worker will become available at t+d+Ot .The accumulated cost duration implied by this overhead decreases proportionally with thenumber of tasks (i.e., the more tasks per cluster the less total overhead).

Second, our model includes an overhead every time a task is released or assigned to aworker. When a task is released, it has to be pushed in the ready task list, and this implieseither some synchronization or lock-free mechanisms with a non-negligible cost. The samehappens when a task is popped from the list and assigned to a worker. Moreover, the

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 9/26

Page 10: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

pushes and pops compete to modify the ready list, and in both cases, this can be seen as alock with only one worker at a time that accesses it. As a result, in our model, we incrementthe global timestamp variable every time the list is modified.

We provide our solution in Algorithm 2, which is an extension of the DAG executionalgorithm with a priority queue of workers. The workers are stored in the queue based ontheir next availability and we traverse the graph while maintaining the dependencies anda list of ready tasks. At line 5, we initialize the current time variable using the number ofready tasks and the push overhead, considering that all these tasks are pushed in the list andthis makes it impossible to access it. Then, at line 8, we assign tasks to the available workersand store them in the priority queues using the cost of each task and the overhead pertask, see line 12. Then, in the core loop, line 16, we pick the next available worker, releasethe dependencies for the task it was computed, and assign tasks to idle workers (untilthere are no more tasks or idle workers). Finally, we wait for the last workers to finish atline 34.

Our execution model is used to emulate the execution of a DAG on a given architecture,but we also use it to found the best cluster granularity. To do so, we emulate executionsstarting with a size M = 2 and we increase M and keep track of the best granularity foundso far B. We stop after granularity 2×B. The idea is that we will have a speedup as weincrease the granularity until we constraint toomuch the degree of parallelism. But in ordernot to stop at the first local minima (the first time an increase of the granularity resultsin an increase of the execution time), we continue to test until the granularity equals twotimes the best granularity we have found.

EXPERIMENTAL STUDYWe evaluate our approach on emulated executions and on a real numerical application.

Emulated executionsGraph data-set.We use graphs of different properties that we summarize in Table 1. Theywere generated from the Polybench collection (Grauer-Gray et al., 2012), the daggen tool(Suter, 2013) or by ourselves. The graphs from daggen are complex in the sense that theirnodes have important number of predecessors/successors and that the cost of the tasks aresignificant and of large intervals. The graphs with names starting by Chukrut are the testcases for the real application.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 10/26

Page 11: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

ALGORITHM 1: GDCA algorithm, whereM is the desired cluster size. GDCAv2 includes the lines inblack underlined-bold. GDCA-ws includes the lines in gray.1 function cluster(G= (V ,E), M)2 ready← Get_roots(G) // Gets the roots3 depths← Distance_from_roots(G, ready) // Gets the distance from roots4 count_deps_release =∅ // # of released dependencies5 cpt_cluster = 06 while ready is not empty do7 count_pred_master =∅ // # predecessors in the new cluster8 count_next_common =∅ // # common successors9 // Sort by, first increasing depths, second decreasing number10 // of predecessors, third increasing ids (to ensure a strict11 // order)12 ready.sort()13 master = ready.pop_front()14 clusters[master] = cpt_cluster15 master_boundary =∅16 for u ∈ successors[master] do17 count_deps_release[u] += 118 if count_deps_release[u] equal |predecessors[u]| then19 ready.insert(u)20 count_pred_master[u] = 121 else22 master_boundary.insert(u)23 end24 end25 cluster_size = 126 while cluster_size < M do27 for u ∈ ready do28 count_next_common[u] = | successors[u] ∩master_boundary |29 end30 // Sort by, first decreasing count_pred_master, second31 // increasing depths, third decreasing count_next_common,32 // fourth increasing ids (to ensure a strict order)33 ready.sort()34 next = ready.front();35 if count_pred_master[next] is 0 AND count_next_common[next] is 0 then36 break;37 end38 ready.pop_front()39 cluster_size += 140 clusters[next] = cpt_cluster41 for u ∈ successors[next] do42 count_deps_release[u] += 143 if count_deps_release[u] equal |predecessors[u]| then44 ready.insert(u)45 count_pred_master[u] = 046 for v ∈ predecessors[u] do47 if clusters[v] == clusters[master] then48 count_pred_master[u] += 149 end50 end51 master_boundary.erase(u)52 else53 master_boundary.insert(u)54 end55 end56 end57 cpt_cluster += 158 end

Hardware. We consider four systems in total, two imaginary hardware with two type ofoverhead low (L) and high (H), with the following properties:

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 11/26

Page 12: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

ALGORITHM 2: Emulate an execution of G usingW workers. The overheads are push_overhead ,pop_overhead and task_overhead .1 function Emulate_execution(G= (V ,E), W , push_overhead, pop_overhead, task_overhead)2 idle_worker← list(0, W-1)3 current_time← 04 ready← Get_roots(G)5 current_time← push_overhead× ready.size()6 nb_computed_task← 07 workers← empty_priority_queue()8 while ready is not empty AND idle_worker is not empty do9 task← ready.pop()10 worker← idle_worker.pop()11 current_time← current_time + pop_overhead12 workers.enqueue(worker, task, current_time, costs[u] + task_overhead)13 nb_computed_task← nb_computed_task + 114 end15 deps← 016 while nb_computed_task 6= |tasks| do17 [task, worker, end]← workers.dequeue()18 current_time←max(current_time, end)19 idle_worker.push(worker)20 for v ∈ successors[u] do21 deps[v]← deps[v] + 122 if |deps[v]| = |predecessors[v]| then23 ready.push(v)24 end25 end26 while ready is not empty AND idle_worker is not empty do27 task← ready.pop()28 worker← idle_worker.pop()29 current_time← current_time + pop_overhead30 workers.enqueue(worker, task, current_time, costs[u] + task_overhead)31 nb_computed_task← nb_computed_task + 132 end33 end34 while nb_computed_task 6= |tasks| do35 [task, worker, end]← workers.dequeue()36 current_time←max(current_time, end)37 end38 return current_time

• Config-40-L: System with 40 threads and overhead per task 0.1, per push 0.2 and perpop 0.2.• Config-40-H: System with 40 threads and overhead per task 2, per push 1 and per pop1.• Config-512-L: System with 512 threads and overhead per task 0.1, per push 0.2 and perpop 0.2.• Config-512-H: System with 512 threads and overhead per task 4, per push 2 and perpop 2.

The given overheads are expressed in terms of proportion of the total execution time ofa graph. Consequently, if D is the total duration of a graph (the sum of the duration of theN tasks), and if the overhead is Ot , then the overhead per task is given by D×Ot/N .Increasing granularity. In Figure 5, we show the duration of the emulated executions asthe granularity increases for twelve of the graphs. As expected the execution time decreasesas the granularity increases in most cases since the impact of the overhead is mitigated.Using 512 threads (Config-512-L/red line) instead of 40 (Config-40-L/green line) does notspeed up the execution when the overhead is low, and this is explained by the fact that theaverage degree of parallelism is lower than 512 for most graphs. In Figs. 5H and 5K, the

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 12/26

Page 13: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

execution time shows a significant variation depending on the clustering size. This meansthat for some cluster sizes several workers are not efficiently used by being idle (waitingfor another a worker to finish its task and release the dependencies), while for some othersizes the workload per worker is well balanced and the execution more efficient. Note thatwe increase the granularity up to two times the best granularity we have found, and thisappears to be a good heuristic to catch the best cluster size without stopping at the firstlocal minima.Details. In Table 2, we provide the best speedup we have obtained by taking executiontimes without clustering as reference.We provide the results for the GDCA and the updatedmethod (GDCAv2), and also show the best granularity for each case. The GDCAv2 providesa speedup over GDCA in many cases. For instance, for the daggen’s graphs the GDCAv2method is faster in most cases. In addition, there are cases where the GDCAv2 provide asignificant speedup, such as the graphs polybench - kernel trmm, polybench - kernel jacobi 2dimper and polybench - kernel gemvr. For the Config-512-H configuration and the polybench- kernel gemvr graph, the GDCA has a speedup of 24.73, and GDCAv2 83.47. However,there are still many cases where GDCA is faster, which means that to cluster a graph froma specific application it is required to try and compare both methods. In addition, thisdemonstrates that while our modifications of GDCA seem natural when we look at thegraph at a low level, they do not necessarily provide an improvement at a global level due tocorner cases.Moreover, the ids of the nodes aremore important in GDCA than in GDCAv2,and this information is usually not random and includes the order of construction of thetasks.

Concerning GDCAws, the method is not competitive for most of the graphs. However,it provides a significant speedup for the Chukrut graphs, which are the ones use in our realapplication.

Real applicationHardware configuration

We use the following computing nodes:

• Intel-24t : 4 × Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, with caches L1 32K,L2256K and L3 15360K (24 threads in total).• Intel-32t : 2 × Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz V4, with caches L1 32K,L21024K and L3 22528K (32 threads in total).

Software configuration.We use the GNU C compiler 7.3.0. We parallelize using OpenMPthreads and we do not use any mutex during the execution. We only use lock-freemechanisms implemented with C11 atomic variables to manage the dependencies and thelist of ready tasks, which is actually an array updated using atomics.Test cases. The two test cases represent a 2D mesh that has the shape of a disk withsizes 40 and 60. The details of the two corresponding graphs are provided in Table 1under the names Chukrut - disque40 and Chukrut - disque60. The execution of a singletask takes around 1.5 ·10−5s. To estimate the overhead, we take the execution time insequential T1 and the execution time using a third of the available threads Tx and do

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 13/26

Page 14: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Table 1 Details of the studied graphs. The degree of parallelism is obtained by iterating on the graph, while respecting the dependencies, and measure the size of theready task list (the average size or the largest size found during the execution).

Name #vertices #edges #Predecessors Total cost Cost Degree of parallelism

avg max min avg max avg max

createdag - agraph-2dgrid-200 39999 79202 1.980 2 39999 1 1 1 100.5 398createdag - agraph-deptree-200 10100 19900 1.970 2 10100 1 1 1 50.5 100createdag - agraph-doubletree-200 10100 19900 1.970 2 10100 1 1 1 50.5 100createdag - agraph-tree-200 20100 39800 1.980 2 20100 1 1 1 100.5 200Chukrut - disque40 19200 38160 1.99 2 19200 1 1 1 80.3347 120Chukrut - disque60 43200 86040 1.99 2 43200 1 1 1 120.334 180daggen - 1000-0 5-0 2-4-0 8 1000 6819 6.819 42 2.2e+14 3.1e+08 2.2e+11 1.3e+12 27.7 52daggen - 128000-0 5-0 2-2-0 8 128000 11374241 88.861 549 3.0e+16 2.6e+08 2.3e+11 1.4e+12 362.6 641daggen - 16000-0 5-0 8-4-0 8 16000 508092 31.756 97 3.7e+15 2.7e+08 2.3e+11 1.4e+12 124.03 155daggen - 4000-0 2-0 8-4-0 8 4000 6540 1.635 7 9.0e+14 2.6e+08 2.2e+11 1.4e+12 7.8 15daggen - 64000-0 2-0 2-4-0 8 64000 169418 2.647 31 1.5e+16 2.6e+08 2.3e+11 1.4e+12 11.84 23polybench - kernel 2 mm 14600 22000 1.507 40 14600 1 1 1 286.275 600polybench - kernel 3 mm 55400 70000 1.264 40 55400 1 1 1 780.282 1400polybench - kernel adi 97440 553177 5.677 7 97440 1 1 1 43.4806 86polybench - kernel atax 97040 144900 1.493 230 97040 1 1 1 220.045 440polybench - kernel covariance 98850 276025 2.792 70 98850 1 1 1 686.458 3500polybench - kernel doitgen 36000 62700 1.742 2 36000 1 1 1 3000 3000polybench - kernel durbin 94372 280870 2.976 250 94372 1 1 1 2.96088 250polybench - kernel fdtd 2d 70020 220535 3.150 4 70020 1 1 1 1167 1200polybench - kernel gemm 340200 336000 0.988 1 340200 1 1 1 4200 4200polybench - kernel gemver 43320 71880 1.659 120 43320 1 1 1 179.008 14400polybench - kernel gesummv 125750 125500 0.998 1 125750 1 1 1 499.008 500polybench - kernel jacobi 1d imper 79600 237208 2.980 3 79600 1 1 1 398 398polybench - kernel jacobi 2d imper 31360 148512 4.736 5 31360 1 1 1 784 784polybench - kernel lu 170640 496120 2.907 79 170640 1 1 1 1080 6241polybench - kernel ludcmp 186920 537521 2.876 80 186920 1 1 1 53.7126 6480polybench - kernel mvt 80000 79600 0.995 1 80000 1 1 1 400 400polybench - kernel seidel 2d 12960 94010 7.254 8 12960 1 1 1 62.3077 81polybench - kernel symm 96000 93600 0.975 1 96000 1 1 1 2400 2400polybench - kernel syr2k 148230 146400 0.988 1 148230 1 1 1 1830 1830polybench - kernel syrk 148230 146400 0.988 1 148230 1 1 1 1830 1830polybench - kernel trisolv 80600 160000 1.985 399 80600 1 1 1 100.75 400polybench - kernel trmm 144000 412460 2.864 80 144000 1 1 1 894.41 1410

Bram

asand

Ketterlin(2020),PeerJ

Com

put.Sci.,DOI10.7717/peerj-cs.247

14/26

Page 15: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

0 10 20 30 40 50 60 70Granularity

104

105

Tem

ps (s

)

disque60.dot

(a)

2 3 4 5 6 7 8 9 10Granularity

1016

6 × 1015

2 × 1016

3 × 1016

Tem

ps (s

)

generated-dag-64000-0_2-0_2-4-0_8.dot

(b)

0 200 400 600 800Granularity

103

104

105

Tem

ps (s

)

kernel_atax.dot

(c)

0 10 20 30 40 50 60 70 80Granularity

105

4 × 104

6 × 104

2 × 105

3 × 105

Tem

ps (s

)

kernel_durbin.dot

(d)

0 10 20 30 40 50 60 70Granularity

104

105

Tem

ps (s

)

kernel_trmm.dot

(e)

5 10 15 20 25Granularity

104

Tem

ps (s

)

kernel_seidel_2d.dot

(f)

0 10 20 30 40 50Granularity

104

Tem

ps (s

)

kernel_jacobi_2d_imper.dot

(g)

0 50 100 150 200Granularity

103

104

105Te

mps

(s)

kernel_gemver.dot

(h)

0 10 20 30 40 50 60 70Granularity

104Tem

ps (s

)

agraph-2dgrid-200.dot

(i)

0 100 200 300 400 500Granularity

103

104

105

Tem

ps (s

)

kernel_covariance.dot

(j)

0 10 20 30 40 50Granularity

104

105

Tem

ps (s

)

kernel_fdtd_2d.dot

(k)

Config 40/0.1/0.2/0.2Config 40/2.0/1.0/1.0Config 512/0.1/0.2/0.2Config 512/4.0/2.0/2.0GDCAGDCAv2

(l) Legend

Figure 5 Emulated execution times against cluster granularityG for different test cases, different ma-chine configurations (colors ----) and different strategies (nodes •N). (A) disque60. (B) generated-dag-4000-0.2-0.2-4-0.8. (C) kernel atax. (D) kernel durbin. (E) kernel trmm. (F) kernel seidel. (G) kernel ja-cobi 2D imper. (H) kernel gemver. (i) agraph 2dgrid-200. (J) kernel covariance. (K) kernel fdtf 2D.

Full-size DOI: 10.7717/peerjcs.247/fig-5

O= (Tx×x−T1)/T1. We obtained an overhead of 3 on Intel-24t and of 3.5 on Intel-32t.We dispatch this overhead one half for the overhead per task, and one quarter for thepush/pop overheads. The execution times are obtained from the average of 6 executions.

We increase the granularity up to M = 100 in order to show if our best granularityselection heuristic would miss the best size.Results. In Fig. 6, we provide the results for the two test cases on the two computing nodes.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 15/26

Page 16: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Table 2 Speedup obtained by clustering the graphs on emulated executions for GDCA and GDCAv2. The best granularity for the different graphs is provided, and thespeedup is in bold when one of the two clustering strategies appears more efficient than the other for a given hardware. GDCAws is slower in most cases, except for thegraphs from createdag and Chukrut. For the configuration Config-512-H, GDCAws get a speedup of 34 for Chukrut - disque40 and of 44 for Chukrut - disque60.

Name Config-40-L Config-40-H Config-512-L Config-512-H

GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2

G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp.

createdag - agraph-2dgrid-200 9 5.775 16 5.821 36 14.79 36 13.86 9 5.786 16 6.078 36 21.76 36 22.57

createdag - agraph-deptree-200 6 4.158 4 3.798 16 12.75 16 12.75 6 4.158 4 3.801 25 16.6 25 16.21

createdag - agraph-doubletree-200 6 4.158 4 3.798 16 12.75 16 12.75 6 4.158 4 3.801 25 16.6 25 16.21

createdag - agraph-tree-200 16 7.905 16 7.905 36 23.68 36 24.06 16 8.978 16 8.976 64 37.5 64 37.47

Chukrut - disque40 16 7.62 16 7.62 25 21.91 25 21.91 16 7.629 16 7.626 25 23.91 25 23.91

Chukrut - disque60 16 10.9 16 10.9 36 30.6 36 30.6 16 11.35 16 11.34 36 34.22 36 34.22

daggen - 1000-0 5-0 2-4-0 8 2 1.425 2 1.575 3 2.34 4 2.221 2 1.425 2 1.575 6 2.704 5 2.858

daggen - 128000-0 5-0 2-2-0 8 2 1.835 2 1.872 4 2.88 5 2.994 2 1.839 2 1.875 7 3.517 6 3.689

daggen - 16000-0 5-0 8-4-0 8 2 1.787 2 1.806 3 2.528 3 2.646 2 1.786 2 1.805 4 2.955 4 3.1

daggen - 4000-0 2-0 8-4-0 8 2 1.066 2 1.08 3 1.95 3 1.97 2 1.066 2 1.08 4000 3.998 4000 3.998

daggen - 64000-0 2-0 2-4-0 8 2 1.214 2 1.203 3 2.085 3 2.115 2 1.214 2 1.203 4 2.476 4 2.555

polybench - kernel 2 mm 18 11.71 18 11.71 62 41.6 57 35.88 62 22.3 31 20.52 124 76.65 130 59.72

polybench - kernel 3 mm 20 13.05 20 13.05 102 48.3 52 43.05 51 30.82 52 28.54 153 97.71 154 77.7

(continued on next page)

Bram

asand

Ketterlin(2020),PeerJ

Com

put.Sci.,DOI10.7717/peerj-cs.247

16/26

Page 17: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Table 2 (continued)Name Config-40-L Config-40-H Config-512-L Config-512-H

GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp.

polybench - kernel adi 10 6.045 10 6.161 23 15.39 24 16.19 10 6.045 10 6.161 30 20.81 34 21.43

polybench - kernel atax 35 13.84 35 13.84 105 55.41 105 55.41 63 42.83 63 42.83 420 150.3 420 150.3

polybench - kernel covariance 17 13.6 17 13.6 68 46.93 143 60.2 142 48.29 142 48.29 284 178.3 211 144.7

polybench - kernel doitgen 960 14.77 960 14.77 960 69.37 960 69.37 120 59.98 120 59.98 360 188.5 360 188.5

polybench - kernel durbin 6 1.756 12 1.806 24 4.954 24 5.322 6 1.752 12 1.809 30 7.983 40 8.594

polybench - kernel fdtd 2d 8 7.116 10 8.339 15 12.39 20 16.74 8 7.116 10 8.34 17 15.26 26 21.49

polybench - kernel gemver 9 8.349 27 12.14 19 17.86 117 45.44 9 8.398 39 24.19 27 24.73 117 83.47

polybench - kernel gesummv 31 14.44 31 14.44 114 61.21 114 61.21 503 83.4 503 83.4 503 333.8 503 333.8

polybench - kernel jacobi 1d imper 15 7.114 12 8.295 32 19.84 32 20.1 15 7.114 12 8.295 32 28.17 60 28.7

polybench - kernel jacobi 2d imper 5 4.853 7 6.046 11 8.184 18 11.85 5 4.853 7 6.046 11 9.841 24 18.02

polybench - kernel lu 16 11.47 17 9.404 39 24.65 39 22.85 22 13.69 16 11.02 86 38.65 89 33.95

polybench - kernel ludcmp 17 6.668 13 6.379 48 16.48 32 16.96 17 7.366 17 6.895 48 24.71 60 25.38

polybench - kernel mvt 400 15.62 400 15.62 400 70.99 400 70.99 200 88.87 200 88.87 600 280.7 600 280.7

polybench - kernel seidel 2d 4 2.319 4 2.659 6 4.219 8 4.856 4 2.319 4 2.659 8 5.705 12 6.772

polybench - kernel symm 2400 15.89 2400 15.89 2400 77.36 2400 77.36 200 97.94 200 97.94 600 308.7 600 308.7

polybench - kernel syr2k 27 14.24 27 14.24 3726 77.85 3726 77.85 324 116.9 324 116.9 810 383.5 810 383.5

polybench - kernel syrk 27 14.24 27 14.24 3726 77.85 3726 77.85 324 116.9 324 116.9 810 383.5 810 383.5

polybench - kernel trisolv 8 4.732 8 4.732 25 15.14 25 15.14 9 4.87 9 4.87 25 18.61 25 18.61

polybench - kernel trmm 11 8.255 14 10.6 25 15.94 29 21.6 11 8.622 14 11.44 32 20.68 37 29

Bram

asand

Ketterlin(2020),PeerJ

Com

put.Sci.,DOI10.7717/peerj-cs.247

17/26

Page 18: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

GDCAGDCAwsGDCAv2Emulation with overheadReal execution

(a) Legend

0 20 40 60 80 100Granularity (M)

100

101

Spee

dup

Intel-24t-4s - Size 40

(b)

0 20 40 60 80 100Granularity (M)

100

101

Spee

dup

Intel-24t-4s - Size 60

(c)

0 20 40 60 80 100Granularity (M)

100

101

Spee

dup

Intel-32t-2s - Size 40

(d)

0 20 40 60 80 100Granularity (M)

100

101

Spee

dup

Intel-32t-2s - Size 60

(e)

Figure 6 Speedup obtained for the two test cases with different clustering strategies.We show thespeedup obtained from emulated executions (dashed lines) and from the real executions (plain lines). (A)Inter-24t-4s Size 40. (B) Inter-24t-4s Size 60. (C) Inter-32t-2s Size 40. (D) Inter-32t-2s Size 60.

Full-size DOI: 10.7717/peerjcs.247/fig-6

In the four configurations, the emulation of GDCAws (blue dashed line) is too optimisticcompared to the real execution (blue line): on Intel-24t, Figs. 6B and 6C, GDCAwsperformed poorly (blue line), but on Intel-32t, Figs. 6D and 6E, GDCAws is efficient forlarge granularities. However, even if it is efficient on average on Intel-32t, it never providesthe best execution. This means that having a flexible cluster size is not the best approachfor these graphs and that having fewer clusters but of a fixed size (even if it adds moredependencies) seems more appropriate. Concerning the emulation, our model does notcatch the impact of having clusters of different sizes.

The emulation of GDCA is more accurate (dashed green line) when we compare itwith the real execution (green line), even if the real execution is always underperforming.However, the global shape is correct and, importantly, the performance peaks that happen

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 18/26

Page 19: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

in the emulation also happen in the real execution. This means that we can use emulatedexecutions to find the best clustering size for the GDCA. In terms of performance, GDCAprovides the best execution on the Intel-32t forM between 10 and 20.

The emulation of GDCAv2 is accurate for the Intel-24t (Figs. 6B and 6C) with asuperimposition of the plots (dashed/plain purple lines). However, it is less accurate forthe Intel-32t (Figs. 6D and 6E) where the real execution is underperforming compared tothe emulation. As for the GDCA, the peaks of the emulation of the GDCAv2 concord withthe peaks of the real executions. GDCAv2 provides the best performance on the Intel-24t,forM between 15 and 20.

GDCA and GDCAv2 have the same peaks; therefore, for some cluster sizes the degreeof parallelism is much better and the filling of the workers more efficient. But GDCAv2is always better on the Intel-32t, while on the Intel-24t both are very similar except thatGDCA is faster at the peaks. While the difference is not necessarily significant this meansthat the choice between GDCA and GDCAv2 is machine-dependent.

CONCLUSIONThe management of the granularity is one of the main challenges to achieve high-performance using the tasks and dependencies paradigm. GDCA allows controllingthe granularity of the tasks by grouping them to obtain a macro-DAG. In this paper, wehave presented GDCAv2 and GDCAws two modified version of GDCA. We evaluatedthe benefit of GDCA and GDCAv2 on emulated executions. We have demonstrated thatour modifications allow obtaining significant speedup in several cases but that it remainsunknown which of the two methods will give the best results. We evaluated the threealgorithms on a real application and we have demonstrated that clustering the DAGallowed to get a speedup of 7 compared to executions without clustering. We were ableto find the best granularity using emulated execution based on a model that incorporatesdifferent overheads.

As a perspective, wewould like to plug our algorithmdirectly inside an existing task-basedruntime system to cluster the DAG on the fly during the execution. This would require asignificant programming effort but will open the study of more applications and certainlylead to improving not only the selection of the parameters but also the estimation of thecosts and the overheads. In addition, we would like to adapt the aggregate-selection duringthe clustering process in order to always use the best of GDCA and GDCAv2.

ACKNOWLEDGEMENTSThe experiments presented in this paper were carried out using the PlaFRIM experimentaltestbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, BordeauxINP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/).

APPENDIXFigures A1 and A2 are examples of graph clustering with our method. Figure A3 shows anexample of a graph clustering and an emulated execution.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 19/26

Page 20: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

240 [1/7]

241 [1/7]

242 [1/30]

224 [1/7]

225 [1/7]

226 [1/25]

208 [1/6]

209 [1/6]

210 [1/25]

192 [1/6]

193 [1/6]

194 [1/21]

176 [1/6]

177 [1/6]

178 [1/21]

160 [1/5]

161 [1/5]

162 [1/17]

144 [1/5]

145 [1/5]

146 [1/17]

128 [1/5]

129 [1/5]

130 [1/13]

112 [1/4]

113 [1/4]

114 [1/13]

96 [1/4]

97 [1/4]

98 [1/11]

80 [1/4]

81 [1/4]

82 [1/11]

64 [1/3]

65 [1/3]

66 [1/8]

48 [1/3]

49 [1/3]

50 [1/8]

32 [1/3]

33 [1/3]

34 [1/8]

16 [1/2]

17 [1/2]

18 [1/7]

15 [1/2]

31 [1/22]

47 [1/22]

14 [1/2]

30 [1/22]

46 [1/22]

13 [1/2]

29 [1/22]

45 [1/22]

12 [1/2]

28 [1/16]

44 [1/16]

11 [1/1]

27 [1/16]

43 [1/16]

10 [1/1]

26 [1/16]

42 [1/16]

9 [1/1]

25 [1/12]

41 [1/12]

8 [1/1]

24 [1/12]

40 [1/12]

7 [1/1]

23 [1/12]

39 [1/12]

6 [1/1]

22 [1/9]

38 [1/9]

5 [1/0]

21 [1/9]

37 [1/9]

4 [1/0]

20 [1/9]

36 [1/9]

3 [1/0]

19 [1/7]

35 [1/8]

2 [1/0]

1 [1/0]

51 [1/8]

52 [1/10]

53 [1/10]

54 [1/10]

55 [1/14]

56 [1/14]

57 [1/14]

58 [1/19]

59 [1/19]

60 [1/19]

61 [1/26]

62 [1/26]

63 [1/26]

67 [1/8]

68 [1/10]

69 [1/10]

70 [1/10]

71 [1/14]

72 [1/14]

73 [1/14]

74 [1/19]

75 [1/19]

76 [1/19]

77 [1/26]

78 [1/26]

79 [1/26]

83 [1/11]

84 [1/11]

85 [1/15]

86 [1/15]

87 [1/15]

88 [1/20]

89 [1/20]

90 [1/20]

91 [1/27]

92 [1/27]

93 [1/27]

94 [1/32]

95 [1/32]

99 [1/11]

100 [1/11]

101 [1/15]

102 [1/15]

103 [1/15]

104 [1/20]

105 [1/20]

106 [1/20]

107 [1/27]

108 [1/27]

109 [1/27]

110 [1/32]

111 [1/32]

115 [1/13]

116 [1/13]

117 [1/18]

118 [1/18]

119 [1/18]

120 [1/24]

121 [1/24]

122 [1/24]

123 [1/30]

124 [1/30]

125 [1/33]

126 [1/33]

127 [1/33]

131 [1/13]

132 [1/13]

133 [1/18]

134 [1/18]

135 [1/18]

136 [1/24]

137 [1/24]

138 [1/24]

139 [1/30]

140 [1/33]

141 [1/33]

142 [1/33]

143 [1/38]

147 [1/17]

148 [1/17]

149 [1/23]

150 [1/23]

151 [1/23]

152 [1/29]

153 [1/29]

154 [1/29]

155 [1/34]

156 [1/34]

157 [1/34]

158 [1/38]

159 [1/38]

163 [1/17]

164 [1/17]

165 [1/23]

166 [1/23]

167 [1/23]

168 [1/29]

169 [1/29]

170 [1/29]

171 [1/34]

172 [1/34]

173 [1/34]

174 [1/38]

175 [1/38]

179 [1/21]

180 [1/21]

181 [1/28]

182 [1/28]

183 [1/28]

184 [1/32]

185 [1/32]

186 [1/36]

187 [1/37]

188 [1/37]

189 [1/37]

190 [1/38]

191 [1/41]

195 [1/21]

196 [1/21]

197 [1/28]

198 [1/28]

199 [1/28]

200 [1/35]

201 [1/35]

202 [1/37]

203 [1/37]

204 [1/37]

205 [1/40]

206 [1/40]

207 [1/42]

211 [1/25]

212 [1/25]

213 [1/31]

214 [1/31]

215 [1/31]

216 [1/35]

217 [1/35]

218 [1/39]

219 [1/39]

220 [1/39]

221 [1/40]

222 [1/40]

223 [1/42]

227 [1/25]

228 [1/25]

229 [1/31]

230 [1/31]

231 [1/31]

232 [1/35]

233 [1/35]

234 [1/39]

235 [1/39]

236 [1/39]

237 [1/40]

238 [1/40]

239 [1/42]

243 [1/30]

244 [1/30]

245 [1/36]

246 [1/36]

247 [1/36]

248 [1/36]

249 [1/36]

250 [1/41]

251 [1/41]

252 [1/41]

253 [1/41]

254 [1/41]

255 [1/42]

0 [1/0]

Figure A1 Clustering of a graph of 256 nodes generated by propagation of the dependencies on a 2Dgrid from one corner to the opposite one. The cluster size isM = 6. There are 43 clusters.

Full-size DOI: 10.7717/peerjcs.247/fig-7

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 20/26

Page 21: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

1200 [1/0]

980 [1/27]

981 [1/19]

982 [1/19]

983 [1/19]

984 [1/19]

985 [1/0]

986 [1/0]

987 [1/0]

988 [1/0]

989 [1/0]

990 [1/0]

991 [1/0]

992 [1/0]

993 [1/0]

994 [1/0]

995 [1/0]

996 [1/0]

997 [1/0]

998 [1/0]

999 [1/0] 1180 [1/1]

1181 [1/1]

1182 [1/1]

1183 [1/1]

1184 [1/4]

1185 [1/4]

1186 [1/4]

1187 [1/4]

1188 [1/8]

1189 [1/8]

1190 [1/8]

1191 [1/8]

1192 [1/14]

1193 [1/14]

1194 [1/14]

1195 [1/14]

1196 [1/21]

1197 [1/21]

1198 [1/21]

1199 [1/21]

960 [1/27]

799 [1/53]

961 [1/19]

962 [1/19]

963 [1/19]

964 [1/19]

965 [1/18]

966 [1/18]

967 [1/18]

968 [1/11]

969 [1/11]

970 [1/11]

971 [1/11]

972 [1/6]

973 [1/6]

974 [1/6]

975 [1/6]

976 [1/2]

977 [1/2]

978 [1/2]

979 [1/2] 1160 [1/1]

1161 [1/1]

1162 [1/1]

1163 [1/1]

1164 [1/4]

1165 [1/4]

1166 [1/4]

1167 [1/4]

1168 [1/8]

1169 [1/8]

1170 [1/8]

1171 [1/8]

1172 [1/14]

1173 [1/14]

1174 [1/14]

1175 [1/14]

1176 [1/21]

1177 [1/21]

1178 [1/21]

580 [1/56]

1179 [1/21]

1140 [1/1]

1141 [1/1]

1142 [1/1]

1143 [1/1]

1144 [1/4]

1145 [1/4]

1146 [1/4]

1147 [1/4]

1148 [1/8]

1149 [1/8]

1150 [1/8]

1151 [1/8]

1152 [1/14]

1153 [1/14]

1154 [1/14]

1155 [1/14]

1156 [1/21]

1157 [1/21]

1158 [1/21]

1201 [1/75]

581 [1/56]

560 [1/56]

1159 [1/21]

1120 [1/1]

1121 [1/1]

1122 [1/1]

1123 [1/1]

1124 [1/4]

1125 [1/4]

1126 [1/4]

1127 [1/4]

1128 [1/8]

1129 [1/8]

1130 [1/8]

1131 [1/8]

1132 [1/14]

1133 [1/14]

1134 [1/14]

1135 [1/14]

1136 [1/21]

1137 [1/21]

1138 [1/21]

561 [1/56]

540 [1/56]

1139 [1/21]

1100 [1/3]

1101 [1/3]

1102 [1/3]

1103 [1/3]

1104 [1/7]

1105 [1/7]

1106 [1/7]

1107 [1/12]

1108 [1/12]

1109 [1/12]

1110 [1/12]

1111 [1/12]

1112 [1/20]

1113 [1/20]

1114 [1/25]

1115 [1/25]

1116 [1/25]

1117 [1/25]

1118 [1/25]

541 [1/56]

520 [1/48]

1119 [1/25]

1080 [1/3]

1081 [1/3]

1082 [1/3]

1083 [1/3]

1084 [1/7]

1085 [1/7]

1086 [1/7]

1087 [1/12]

1088 [1/12]

1089 [1/12]

1090 [1/12]

1091 [1/12]

1092 [1/20]

1093 [1/20]

1094 [1/25]

1095 [1/25]

1096 [1/25]

1097 [1/25]

1098 [1/25]

521 [1/48]

500 [1/48]

1099 [1/25]

1060 [1/3]

1061 [1/3]

1062 [1/3]

1063 [1/3]

1064 [1/7]

1065 [1/7]

1066 [1/13]

1067 [1/13]

1068 [1/13]

1069 [1/13]

1070 [1/20]

1071 [1/20]

1072 [1/20]

1073 [1/27]

1074 [1/27]

1075 [1/27]

1076 [1/32]

1077 [1/32]

1078 [1/32]

501 [1/48]

480 [1/48]

1079 [1/32]

1040 [1/3]

1041 [1/3]

1042 [1/3]

1043 [1/3]

1044 [1/12]

1045 [1/12]

1046 [1/13]

1047 [1/13]

1048 [1/13]

1049 [1/13]

1050 [1/20]

1051 [1/20]

1052 [1/20]

1053 [1/27]

1054 [1/27]

1055 [1/27]

1056 [1/32]

1057 [1/32]

1058 [1/32]

481 [1/48]

460 [1/48]

1059 [1/32]

1020 [1/7]

1021 [1/7]

1022 [1/7]

1023 [1/7]

1024 [1/12]

1025 [1/12]

1026 [1/13]

1027 [1/13]

1028 [1/13]

1029 [1/13]

1030 [1/20]

1031 [1/20]

1032 [1/20]

1033 [1/27]

1034 [1/27]

1035 [1/27]

1036 [1/32]

1037 [1/32]

1038 [1/32]

461 [1/48]

440 [1/42]

1039 [1/32]

1000 [1/7]

1001 [1/7]

1002 [1/7]

1003 [1/7]

1004 [1/12]

1005 [1/12]

1006 [1/13]

1007 [1/13]

1008 [1/13]

1009 [1/13]

1010 [1/20]

1011 [1/20]

1012 [1/20]

1013 [1/27]

1014 [1/27]

1015 [1/27]

1016 [1/32]

1017 [1/32]

1018 [1/32]

441 [1/42]

420 [1/42]

1019 [1/32]

0 [1/9]

20 [1/9]

40 [1/9]

60 [1/15]

80 [1/15]

100 [1/15]

120 [1/15]

140 [1/23]

160 [1/23]

180 [1/23]

200 [1/23]

220 [1/31]

240 [1/31]

260 [1/31]

280 [1/31]

300 [1/37]

320 [1/37]

340 [1/37]

360 [1/37]

421 [1/42]

400 [1/42]

380 [1/42]

1 [1/9]

21 [1/9]

41 [1/9]

61 [1/15]

81 [1/15]

101 [1/15]

121 [1/15]

141 [1/23]

161 [1/23]

181 [1/23]

201 [1/23]

221 [1/31]

241 [1/31]

261 [1/31]

281 [1/31]

301 [1/37]

321 [1/37]

341 [1/37]

361 [1/37]

401 [1/42]

381 [1/42]

959 [1/2]

958 [1/2] 939 [1/2]

938 [1/2]957 [1/2] 919 [1/2]

918 [1/2]937 [1/2] 899 [1/5]

898 [1/5]917 [1/2] 879 [1/5]

878 [1/5]897 [1/5] 859 [1/5]

858 [1/5]877 [1/5] 839 [1/5]

838 [1/5]857 [1/5] 819 [1/9]

818 [1/9]837 [1/5]

817 [1/9]

2 [1/9]

22 [1/9]

42 [1/9]

62 [1/15]

82 [1/15]

102 [1/15]

122 [1/15]

142 [1/23]

162 [1/23]

182 [1/23]

202 [1/23]

222 [1/31]

242 [1/31]

262 [1/31]

282 [1/31]

302 [1/37]

322 [1/37]

342 [1/37]

362 [1/37]

382 [1/42]

402 [1/42]

422 [1/42]

442 [1/42]

462 [1/48]

482 [1/48]

502 [1/48]

522 [1/48]

542 [1/56]

562 [1/56]

582 [1/56]

956 [1/2]

936 [1/2]

916 [1/2]

896 [1/5]

876 [1/5]

856 [1/5]

836 [1/5]

816 [1/9]

3 [1/9]

23 [1/9]

43 [1/9]

63 [1/15]

83 [1/15]

103 [1/15]

123 [1/15]

143 [1/23]

163 [1/23]

183 [1/23]

203 [1/23]

223 [1/31]

243 [1/31]

263 [1/31]

283 [1/31]

303 [1/37]

323 [1/37]

343 [1/37]

363 [1/37]

383 [1/42]

403 [1/42]

423 [1/42]

443 [1/42]

463 [1/48]

483 [1/48]

503 [1/48]

523 [1/48]

543 [1/56]

563 [1/56]

583 [1/56]

955 [1/6]

935 [1/6]

915 [1/6]

895 [1/10]

875 [1/10]

855 [1/10]

835 [1/10]

815 [1/16]

4 [1/16]

24 [1/16]

44 [1/16]

64 [1/22]

84 [1/22]

104 [1/22]

124 [1/22]

144 [1/30]

164 [1/30]

184 [1/30]

204 [1/30]

224 [1/36]

244 [1/36]

264 [1/36]

284 [1/36]

304 [1/41]

324 [1/41]

344 [1/41]

364 [1/41]

384 [1/47]

404 [1/47]

424 [1/47]

444 [1/47]

464 [1/55]

484 [1/55]

504 [1/55]

524 [1/55]

544 [1/56]

564 [1/56]

584 [1/56]

954 [1/6]

934 [1/6]

914 [1/6]

894 [1/10]

874 [1/10]

854 [1/10]

834 [1/10]

814 [1/16]

5 [1/16]

25 [1/16]

45 [1/16]

65 [1/22]

85 [1/22]

105 [1/22]

125 [1/22]

145 [1/30]

165 [1/30]

185 [1/30]

205 [1/30]

225 [1/36]

245 [1/36]

265 [1/36]

285 [1/36]

305 [1/41]

325 [1/41]

345 [1/41]

365 [1/41]

385 [1/47]

405 [1/47]

425 [1/47]

445 [1/47]

465 [1/55]

485 [1/55]

505 [1/55]

525 [1/55]

545 [1/56]

565 [1/64]

585 [1/64]

953 [1/6]

933 [1/6]

913 [1/6]

893 [1/10]

873 [1/10]

853 [1/10]

833 [1/10]

813 [1/16]

6 [1/16]

26 [1/16]

46 [1/16]

66 [1/22]

86 [1/22]

106 [1/22]

126 [1/22]

146 [1/30]

166 [1/30]

186 [1/30]

206 [1/30]

226 [1/36]

246 [1/36]

266 [1/36]

286 [1/36]

306 [1/41]

326 [1/41]

346 [1/41]

366 [1/41]

386 [1/47]

406 [1/47]

426 [1/47]

446 [1/47]

466 [1/55]

486 [1/55]

506 [1/55]

526 [1/55]

546 [1/64]

566 [1/64]

586 [1/64]

952 [1/6]

932 [1/6]

912 [1/6]

892 [1/10]

872 [1/10]

852 [1/10]

832 [1/10]

812 [1/16]

7 [1/16]

27 [1/16]

47 [1/16]

67 [1/22]

87 [1/22]

107 [1/22]

127 [1/22]

147 [1/30]

167 [1/30]

187 [1/30]

207 [1/30]

227 [1/36]

247 [1/36]

267 [1/36]

287 [1/36]

307 [1/41]

327 [1/41]

347 [1/41]

367 [1/41]

387 [1/47]

407 [1/47]

427 [1/47]

447 [1/47]

467 [1/55]

487 [1/55]

507 [1/55]

527 [1/55]

547 [1/64]

567 [1/64]

587 [1/64]

951 [1/11]

931 [1/11]

911 [1/11]

891 [1/17]

871 [1/17]

851 [1/17]

831 [1/17]

811 [1/24]

8 [1/24]

28 [1/24]

48 [1/24]

68 [1/29]

88 [1/29]

108 [1/29]

128 [1/29]

148 [1/35]

168 [1/35]

188 [1/35]

208 [1/35]

228 [1/40]

248 [1/40]

268 [1/40]

288 [1/40]

308 [1/46]

328 [1/46]

348 [1/46]

368 [1/46]

388 [1/54]

408 [1/54]

428 [1/54]

448 [1/54]

468 [1/62]

488 [1/62]

508 [1/62]

528 [1/62]

548 [1/64]

568 [1/64]

588 [1/69]

950 [1/11]

930 [1/11]

910 [1/11]

890 [1/17]

870 [1/17]

850 [1/17]

830 [1/17]

810 [1/24]

9 [1/24]

29 [1/24]

49 [1/24]

69 [1/29]

89 [1/29]

109 [1/29]

129 [1/29]

149 [1/35]

169 [1/35]

189 [1/35]

209 [1/35]

229 [1/40]

249 [1/40]

269 [1/40]

289 [1/40]

309 [1/46]

329 [1/46]

349 [1/46]

369 [1/46]

389 [1/54]

409 [1/54]

429 [1/54]

449 [1/54]

469 [1/62]

489 [1/62]

509 [1/62]

529 [1/62]

549 [1/64]

569 [1/64]

589 [1/69]

949 [1/11]

929 [1/11]

909 [1/11]

889 [1/17]

869 [1/17]

849 [1/17]

829 [1/17]

809 [1/24]

10 [1/24]

30 [1/24]

50 [1/24]

70 [1/29]

90 [1/29]

110 [1/29]

130 [1/29]

150 [1/35]

170 [1/35]

190 [1/35]

210 [1/35]

230 [1/40]

250 [1/40]

270 [1/40]

290 [1/40]

310 [1/46]

330 [1/46]

350 [1/46]

370 [1/46]

390 [1/54]

410 [1/54]

430 [1/54]

450 [1/54]

470 [1/62]

490 [1/62]

510 [1/62]

530 [1/62]

550 [1/64]

570 [1/64]

590 [1/69]

948 [1/11]

928 [1/11]

908 [1/11]

888 [1/17]

868 [1/17]

848 [1/17]

828 [1/17]

808 [1/24]

11 [1/24]

31 [1/24]

51 [1/24]

71 [1/29]

91 [1/29]

111 [1/29]

131 [1/29]

151 [1/35]

171 [1/35]

191 [1/35]

211 [1/35]

231 [1/40]

251 [1/40]

271 [1/40]

291 [1/40]

311 [1/46]

331 [1/46]

351 [1/46]

371 [1/46]

391 [1/54]

411 [1/54]

431 [1/54]

451 [1/54]

471 [1/62]

491 [1/62]

511 [1/62]

531 [1/62]

551 [1/64]

571 [1/64]

591 [1/69]

947 [1/18]

927 [1/18]

907 [1/18]

887 [1/18]

867 [1/18]

847 [1/25]

827 [1/25]

807 [1/25]

12 [1/25]

32 [1/33]

52 [1/33]

72 [1/33]

92 [1/33]

112 [1/33]

132 [1/33]

152 [1/39]

172 [1/43]

192 [1/43]

212 [1/43]

232 [1/43]

252 [1/50]

272 [1/50]

292 [1/50]

312 [1/50]

332 [1/57]

352 [1/57]

372 [1/57]

392 [1/57]

412 [1/63]

432 [1/63]

452 [1/63]

472 [1/63]

492 [1/68]

512 [1/68]

532 [1/68]

552 [1/68]

572 [1/72]

592 [1/72]

946 [1/18]

926 [1/18]

906 [1/18]

886 [1/18]

866 [1/26]

846 [1/26]

826 [1/26]

806 [1/26]

13 [1/26]

33 [1/33]

53 [1/33]

73 [1/33]

93 [1/33]

113 [1/33]

133 [1/39]

153 [1/39]

173 [1/43]

193 [1/43]

213 [1/43]

233 [1/43]

253 [1/50]

273 [1/50]

293 [1/50]

313 [1/50]

333 [1/57]

353 [1/57]

373 [1/57]

393 [1/57]

413 [1/63]

433 [1/63]

453 [1/63]

473 [1/63]

493 [1/68]

513 [1/68]

533 [1/68]

553 [1/68]

573 [1/72]

593 [1/72]

945 [1/18]

925 [1/18]

905 [1/18]

885 [1/18]

865 [1/26]

845 [1/26]

825 [1/26]

805 [1/26]

14 [1/26]

34 [1/33]

54 [1/33]

74 [1/33]

94 [1/33]

114 [1/33]

134 [1/39]

154 [1/39]

174 [1/43]

194 [1/43]

214 [1/43]

234 [1/43]

254 [1/50]

274 [1/50]

294 [1/50]

314 [1/50]

334 [1/57]

354 [1/57]

374 [1/57]

394 [1/57]

414 [1/63]

434 [1/63]

454 [1/63]

474 [1/63]

494 [1/68]

514 [1/68]

534 [1/68]

554 [1/68]

574 [1/72]

594 [1/72]

944 [1/19]

924 [1/19]

904 [1/26]

884 [1/26]

864 [1/26]

844 [1/26]

824 [1/26]

804 [1/26]

15 [1/34]

35 [1/34]

55 [1/38]

75 [1/38]

95 [1/39]

115 [1/39]

135 [1/39]

155 [1/39]

175 [1/43]

195 [1/43]

215 [1/43]

235 [1/43]

255 [1/50]

275 [1/50]

295 [1/50]

315 [1/50]

335 [1/57]

355 [1/57]

375 [1/57]

395 [1/57]

415 [1/63]

435 [1/63]

455 [1/63]

475 [1/63]

495 [1/68]

515 [1/68]

535 [1/68]

555 [1/68]

575 [1/72]

595 [1/72]

943 [1/19]

923 [1/19]

903 [1/28]

883 [1/28]

863 [1/28]

843 [1/28]

823 [1/34]

803 [1/34]

16 [1/34]

36 [1/38]

56 [1/38]

76 [1/38]

96 [1/39]

116 [1/39]

136 [1/39]

156 [1/39]

176 [1/49]

196 [1/49]

216 [1/51]

236 [1/51]

256 [1/51]

276 [1/51]

296 [1/58]

316 [1/58]

336 [1/58]

356 [1/58]

376 [1/65]

396 [1/65]

416 [1/65]

436 [1/65]

456 [1/69]

476 [1/69]

496 [1/69]

516 [1/72]

536 [1/72]

556 [1/72]

576 [1/74]

596 [1/74]

942 [1/19]

922 [1/19]

902 [1/28]

882 [1/28]

862 [1/28]

842 [1/28]

822 [1/34]

802 [1/34]

17 [1/34]

37 [1/38]

57 [1/38]

77 [1/38]

97 [1/39]

117 [1/39]

137 [1/39]

157 [1/49]

177 [1/49]

197 [1/49]

217 [1/51]

237 [1/51]

257 [1/51]

277 [1/51]

297 [1/58]

317 [1/58]

337 [1/58]

357 [1/58]

377 [1/65]

397 [1/65]

417 [1/65]

437 [1/65]

457 [1/69]

477 [1/69]

497 [1/69]

517 [1/72]

537 [1/72]

557 [1/72]

577 [1/74]

597 [1/74]

941 [1/19]

921 [1/19]

901 [1/28]

881 [1/28]

861 [1/28]

841 [1/28]

821 [1/34]

801 [1/34]

18 [1/34]

38 [1/38]

58 [1/38]

78 [1/44]

98 [1/44]

118 [1/44]

138 [1/44]

158 [1/49]

178 [1/49]

198 [1/49]

218 [1/51]

238 [1/51]

258 [1/51]

278 [1/51]

298 [1/58]

318 [1/58]

338 [1/58]

358 [1/58]

378 [1/65]

398 [1/65]

418 [1/65]

438 [1/65]

458 [1/69]

478 [1/69]

498 [1/69]

518 [1/72]

538 [1/72]

558 [1/74]

578 [1/74]

598 [1/74]

940 [1/27]

920 [1/27]

900 [1/28]

880 [1/28]

860 [1/28]

840 [1/28]

820 [1/34]

800 [1/34]

19 [1/34]

39 [1/38]

59 [1/38]

79 [1/44]

99 [1/44]

119 [1/44]

139 [1/44]

159 [1/49]

179 [1/49]

199 [1/49]

219 [1/51]

239 [1/51]

259 [1/51]

279 [1/51]

299 [1/58]

319 [1/58]

339 [1/58]

359 [1/58]

379 [1/65]

399 [1/65]

419 [1/65]

439 [1/65]

459 [1/69]

479 [1/69]

499 [1/69]

519 [1/74]

539 [1/74]

559 [1/74]

579 [1/74]

599 [1/75]

779 [1/53]

759 [1/53]

739 [1/53]

719 [1/45]

699 [1/45]

679 [1/45]

659 [1/45]

639 [1/34]

619 [1/34]

618 [1/38]

617 [1/38]

616 [1/44]

615 [1/44]

614 [1/44]

613 [1/44]

612 [1/49]

611 [1/49]

610 [1/49]

609 [1/59]

608 [1/59]

607 [1/59]

606 [1/59]

605 [1/59]

604 [1/59]

603 [1/59]

602 [1/59]

601 [1/70]

600 [1/70]

798 [1/53]

778 [1/53]

758 [1/53]

738 [1/53]

718 [1/45]

698 [1/45]

678 [1/45]

658 [1/45]

638 [1/38]

637 [1/38]

657 [1/45]

677 [1/45]

697 [1/45]

717 [1/45]

737 [1/53]

757 [1/53]

777 [1/53]

797 [1/53]

636 [1/44]

656 [1/45]

676 [1/45]

696 [1/45]

716 [1/45]

736 [1/53]

756 [1/53]

776 [1/53]

796 [1/53]

635 [1/44]

655 [1/52]

675 [1/52]

695 [1/52]

715 [1/52]

735 [1/61]

755 [1/61]

775 [1/61]

795 [1/61]

634 [1/44]

654 [1/52]

674 [1/52]

694 [1/52]

714 [1/52]

734 [1/61]

754 [1/61]

774 [1/61]

794 [1/61]

633 [1/44]

653 [1/52]

673 [1/52]

693 [1/52]

713 [1/52]

733 [1/61]

753 [1/61]

773 [1/61]

793 [1/61]

632 [1/49]

652 [1/52]

672 [1/52]

692 [1/52]

712 [1/52]

732 [1/61]

752 [1/61]

772 [1/61]

792 [1/61]

631 [1/49]

651 [1/60]

671 [1/60]

691 [1/60]

711 [1/60]

731 [1/67]

751 [1/67]

771 [1/67]

791 [1/67]

630 [1/59]

650 [1/60]

670 [1/60]

690 [1/60]

710 [1/60]

730 [1/67]

750 [1/67]

770 [1/67]

790 [1/67]

629 [1/59]

649 [1/60]

669 [1/60]

689 [1/60]

709 [1/60]

729 [1/67]

749 [1/67]

769 [1/67]

789 [1/67]

628 [1/59]

648 [1/60]

668 [1/60]

688 [1/60]

708 [1/60]

728 [1/67]

748 [1/67]

768 [1/67]

788 [1/67]

627 [1/59]

647 [1/66]

667 [1/66]

687 [1/66]

707 [1/66]

727 [1/71]

747 [1/71]

767 [1/71]

787 [1/71]

626 [1/59]

646 [1/66]

666 [1/66]

686 [1/66]

706 [1/66]

726 [1/71]

746 [1/71]

766 [1/71]

786 [1/71]

625 [1/59]

645 [1/66]

665 [1/66]

685 [1/66]

705 [1/66]

725 [1/71]

745 [1/71]

765 [1/71]

785 [1/71]

624 [1/59]

644 [1/66]

664 [1/66]

684 [1/66]

704 [1/66]

724 [1/71]

744 [1/71]

764 [1/71]

784 [1/71]

623 [1/59]

643 [1/70]

663 [1/70]

683 [1/70]

703 [1/73]

723 [1/73]

743 [1/73]

763 [1/73]

783 [1/73]

622 [1/70]

642 [1/70]

662 [1/70]

682 [1/70]

702 [1/73]

722 [1/73]

742 [1/73]

762 [1/73]

782 [1/73]

621 [1/70]

641 [1/70]

661 [1/70]

681 [1/70]

701 [1/73]

721 [1/73]

741 [1/73]

761 [1/73]

781 [1/73]

620 [1/70]

640 [1/70]

660 [1/70]

680 [1/73]

700 [1/74]

720 [1/74]

740 [1/74]

760 [1/74]

780 [1/74]

Figure A2 Clustering of a graph of 1202 nodes generated by the transport equation on a disk. The clus-ter size isM = 16. There are 76 clusters.

Full-size DOI: 10.7717/peerjcs.247/fig-8

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 21/26

Page 22: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

63 [1/30]

127 [1/30] 126 [1/30] 119 [1/30]

183 [1/44]190 [1/60]191 [1/61] 182 [1/44] 189 [1/60] 175 [1/56]

62 [1/29]

125 [1/46]118 [1/44]

181 [1/46] 188 [1/46] 174 [1/44]

61 [1/29]

124 [1/29]117 [1/29]

180 [1/43] 187 [1/45]173 [1/41]

60 [1/28]

123 [1/45]116 [1/43]

179 [1/45] 186 [1/45]172 [1/43]

59 [1/28]

122 [1/28]115 [1/28]

178 [1/58] 185 [1/59]171 [1/55]

58 [1/27]

121 [1/27]114 [1/27]

177 [1/57] 184 [1/27] 170 [1/40]

57 [1/26]

120 [1/26]113 [1/42]

176 [1/42] 169 [1/42]

56 [1/26]

112 [1/26]

168 [1/39]

55 [1/25]

111 [1/41]

167 [1/55]

255 [1/61] 247 [1/61] 246 [1/74] 239 [1/56]254 [1/61] 253 [1/78]

54 [1/25]

110 [1/25]

166 [1/25]

245 [1/60] 238 [1/56]252 [1/60]

53 [1/24]

109 [1/41]

165 [1/41]

244 [1/46] 237 [1/73] 251 [1/77]

52 [1/24]

108 [1/24]

164 [1/24]

243 [1/58] 236 [1/72] 250 [1/76]

51 [1/23]

107 [1/40]

163 [1/40]

242 [1/58] 235 [1/71] 249 [1/75]

50 [1/23]

106 [1/23]

162 [1/23]

241 [1/59] 234 [1/58]248 [1/59]

49 [1/22]

105 [1/39]

161 [1/39]

240 [1/57] 233 [1/57]

48 [1/22]

104 [1/22]

160 [1/22]

232 [1/42]

47 [1/21]

103 [1/21]

159 [1/54]

231 [1/56]

311 [1/74]318 [1/90]319 [1/91] 310 [1/74] 317 [1/90] 303 [1/89]

46 [1/21]

102 [1/21]

158 [1/38]

230 [1/55]

309 [1/78] 316 [1/78] 302 [1/74]

45 [1/20]

101 [1/20]

157 [1/53]

229 [1/71]

308 [1/72] 315 [1/77]301 [1/73]

44 [1/20]

100 [1/20]

156 [1/37]

228 [1/43]

307 [1/77] 314 [1/77]300 [1/73]

43 [1/19]

99 [1/19]

155 [1/52]

227 [1/70]

306 [1/76] 313 [1/76]299 [1/72]

42 [1/19]

98 [1/19]

154 [1/36]

226 [1/40]

305 [1/75] 312 [1/75] 298 [1/88]

41 [1/18]

97 [1/18]

153 [1/51]

225 [1/69]

304 [1/59] 297 [1/87]

40 [1/18]

96 [1/18]

152 [1/35]

224 [1/39]

296 [1/57]

39 [1/17]

95 [1/38]

151 [1/38]

223 [1/55]

295 [1/87]

383 [1/91] 375 [1/91] 374 [1/106] 367 [1/89]382 [1/91] 381 [1/109]

38 [1/17]

94 [1/17]

150 [1/17]

222 [1/54]

294 [1/71]

373 [1/90] 366 [1/89]380 [1/90]

37 [1/16]

93 [1/37]

149 [1/37]

221 [1/68]

293 [1/73]

372 [1/78] 365 [1/103] 379 [1/108]

36 [1/16]

92 [1/16]

148 [1/16]

220 [1/53]

292 [1/72]

371 [1/105] 364 [1/102] 378 [1/107]

35 [1/15]

91 [1/36]

147 [1/36]

219 [1/67]

291 [1/86]

370 [1/104] 363 [1/102] 377 [1/107]

34 [1/15]

90 [1/15]

146 [1/15]

218 [1/52]

290 [1/70]

369 [1/104] 362 [1/88]376 [1/76]

33 [1/14]

89 [1/35]

145 [1/35]

217 [1/66]

289 [1/85]

368 [1/75] 361 [1/88]

32 [1/14]

88 [1/14]

144 [1/14]

216 [1/51]

288 [1/69]

360 [1/101]

31 [1/13]

87 [1/13]

143 [1/50]

215 [1/54]

287 [1/84]

359 [1/89]

439 [1/106]446 [1/125]447 [1/126] 438 [1/106] 445 [1/124] 431 [1/121]

30 [1/13]

86 [1/13]

142 [1/34]

214 [1/38]

286 [1/83]

358 [1/87]

437 [1/109] 444 [1/109] 430 [1/106]

29 [1/12]

85 [1/12]

141 [1/49]

213 [1/53]

285 [1/71]

357 [1/101]

436 [1/105] 443 [1/108]429 [1/120]

28 [1/12]

84 [1/12]

140 [1/33]

212 [1/37]

284 [1/68]

356 [1/86]

435 [1/108] 442 [1/108]428 [1/103]

27 [1/11]

83 [1/11]

139 [1/48]

211 [1/52]

283 [1/70]

355 [1/86]

434 [1/122] 441 [1/123]427 [1/105]

26 [1/11]

82 [1/11]

138 [1/32]

210 [1/36]

282 [1/67]

354 [1/88]

433 [1/107] 440 [1/107] 426 [1/119]

25 [1/10]

81 [1/10]

137 [1/47]

209 [1/51]

281 [1/69]

353 [1/100]

432 [1/104] 425 [1/104]

24 [1/10]

80 [1/10]

136 [1/31]

208 [1/35]

280 [1/66]

352 [1/85]

424 [1/118]

23 [1/9]

79 [1/34]

135 [1/34]

207 [1/65]

279 [1/83]

351 [1/87]

423 [1/117]

511 [1/137] 503 [1/126] 502 [1/125] 495 [1/121]510 [1/126] 509 [1/125]

22 [1/9]

78 [1/9]

134 [1/9]

206 [1/50]

278 [1/54]

350 [1/84]

422 [1/101]

501 [1/124] 494 [1/121]508 [1/124]

21 [1/8]

77 [1/33]

133 [1/33]

205 [1/64]

277 [1/68]

349 [1/99]

421 [1/103]

500 [1/109] 493 [1/120] 507 [1/136]

20 [1/8]

76 [1/8]

132 [1/8]

204 [1/49]

276 [1/53]

348 [1/98]

420 [1/102]

499 [1/122] 492 [1/120] 506 [1/136]

19 [1/7]

75 [1/32]

131 [1/32]

203 [1/63]

275 [1/67]

347 [1/86]

419 [1/102]

498 [1/122] 491 [1/134] 505 [1/123]

18 [1/7]

74 [1/7]

130 [1/7]

202 [1/48]

274 [1/52]

346 [1/70]

418 [1/100]

497 [1/123] 490 [1/122]504 [1/123]

17 [1/6]

73 [1/31]

129 [1/31]

201 [1/62]

273 [1/66]

345 [1/85]

417 [1/100]

496 [1/135] 489 [1/119]

16 [1/6]

72 [1/6]

128 [1/6]

200 [1/47]

272 [1/51]

344 [1/69]

416 [1/117]

488 [1/118]

15 [1/5]

71 [1/5]

199 [1/50]

271 [1/82]

343 [1/84]

415 [1/116]

487 [1/121]

567 [1/137]574 [1/137]575 [1/137] 566 [1/126] 573 [1/155] 559 [1/152]

14 [1/5]

70 [1/5]

198 [1/34]

270 [1/65]

342 [1/97]

414 [1/99]

486 [1/133]

565 [1/125] 572 [1/155] 558 [1/151]

13 [1/4]

69 [1/4]

197 [1/49]

269 [1/81]

341 [1/97]

413 [1/101]

485 [1/120]

564 [1/124] 571 [1/154]557 [1/150]

12 [1/4]

68 [1/4]

196 [1/33]

268 [1/64]

340 [1/68]

412 [1/99]

484 [1/103]

563 [1/153] 570 [1/154]556 [1/150]

11 [1/3]

67 [1/3]

195 [1/48]

267 [1/80]

339 [1/96]

411 [1/98]

483 [1/105]

562 [1/136] 569 [1/136]555 [1/134]

10 [1/3]

66 [1/3]

194 [1/32]

266 [1/63]

338 [1/67]

410 [1/115]

482 [1/119]

561 [1/135] 568 [1/153] 554 [1/134]

9 [1/2]

65 [1/2]

193 [1/47]

265 [1/79]

337 [1/95]

409 [1/100]

481 [1/117]

560 [1/135] 553 [1/149]

8 [1/2]

64 [1/2]

192 [1/31]

264 [1/62]

336 [1/66]

408 [1/85]

480 [1/118]

552 [1/135]

7 [1/1]

263 [1/65]

335 [1/83]

407 [1/114]

479 [1/133]

551 [1/149]

639 [1/170] 631 [1/166] 630 [1/151] 623 [1/152]638 [1/170] 637 [1/169]

6 [1/1]

262 [1/50]

334 [1/82]

406 [1/113]

478 [1/116]

550 [1/148]

629 [1/166] 622 [1/152]636 [1/169]

327 [1/82]

399 [1/84]

471 [1/116]

543 [1/133]

615 [1/152]

695 [1/185]702 [1/187]703 [1/188] 694 [1/170] 701 [1/170] 687 [1/166]

5 [1/1]

261 [1/64]

333 [1/94]

405 [1/99]

477 [1/132]

549 [1/148]

628 [1/155] 621 [1/151] 635 [1/155]

326 [1/65]

398 [1/110]

470 [1/114]

542 [1/146]

614 [1/151]

693 [1/184] 700 [1/186] 686 [1/183]

391 [1/83]

463 [1/114]

535 [1/133]

607 [1/149]

679 [1/181]

767 [1/188] 759 [1/188] 758 [1/187] 751 [1/185]766 [1/188] 765 [1/187]

4 [1/1]

260 [1/49]

332 [1/81]

404 [1/98]

476 [1/131]

548 [1/147]

627 [1/165] 620 [1/164] 634 [1/168]

325 [1/81]

397 [1/97]

469 [1/113]

541 [1/132]

613 [1/162]

692 [1/169] 699 [1/169]685 [1/166]

390 [1/82]

462 [1/113]

534 [1/116]

606 [1/148]

678 [1/162]

757 [1/199] 750 [1/183]764 [1/186]

455 [1/129]

527 [1/140]

599 [1/159]

671 [1/178]

743 [1/181]

823 [1/216]830 [1/219]831 [1/220] 822 [1/216] 829 [1/218] 815 [1/214]

3 [1/0]

259 [1/63]

331 [1/93]

403 [1/112]

475 [1/131]

547 [1/134]

626 [1/154] 619 [1/153] 633 [1/154]

324 [1/64]

396 [1/94]

468 [1/112]

540 [1/132]

612 [1/150]

691 [1/184] 698 [1/185]684 [1/182]

389 [1/94]

461 [1/110]

533 [1/132]

605 [1/148]

677 [1/162]

756 [1/186] 749 [1/198] 763 [1/186]

454 [1/110]

526 [1/114]

598 [1/146]

670 [1/177]

742 [1/183]

821 [1/215] 828 [1/187] 814 [1/213]

519 [1/129]

591 [1/140]

663 [1/159]

735 [1/181]

807 [1/185]

895 [1/234] 887 [1/220] 886 [1/219] 879 [1/230]894 [1/220] 893 [1/219]

2 [1/0]

258 [1/48]

330 [1/80]

402 [1/96]

474 [1/115]

546 [1/146]

625 [1/165] 618 [1/164]632 [1/167]

323 [1/80]

395 [1/96]

467 [1/112]

539 [1/145]

611 [1/147]

690 [1/168] 697 [1/168]683 [1/182]

388 [1/81]

460 [1/98]

532 [1/142]

604 [1/147]

676 [1/180]

755 [1/184] 748 [1/198] 762 [1/201]

453 [1/97]

525 [1/113]

597 [1/142]

669 [1/162]

741 [1/180]

820 [1/215] 827 [1/217]813 [1/213]

518 [1/129]

590 [1/140]

662 [1/159]

734 [1/178]

806 [1/183]

885 [1/218] 878 [1/216]892 [1/218]

583 [1/140]

655 [1/159]

727 [1/178]

799 [1/181]

871 [1/214]

951 [1/234]958 [1/234]959 [1/234] 950 [1/220] 957 [1/251] 943 [1/230]

1 [1/0]

257 [1/62]

329 [1/92]

401 [1/111]

473 [1/115]

545 [1/119]

624 [1/153] 617 [1/163]

322 [1/63]

394 [1/93]

466 [1/115]

538 [1/131]

610 [1/161]

689 [1/165] 696 [1/167] 682 [1/164]

387 [1/93]

459 [1/112]

531 [1/131]

603 [1/145]

675 [1/161]

754 [1/199] 747 [1/184] 761 [1/200]

452 [1/94]

524 [1/139]

596 [1/142]

668 [1/150]

740 [1/196]

819 [1/199] 826 [1/201]812 [1/212]

517 [1/110]

589 [1/158]

661 [1/175]

733 [1/177]

805 [1/212]

884 [1/215] 877 [1/229] 891 [1/217]

582 [1/129]

654 [1/173]

726 [1/177]

798 [1/210]

870 [1/213]

949 [1/248] 956 [1/219] 942 [1/230]

647 [1/171]

719 [1/193]

791 [1/207]

863 [1/225]

935 [1/230]

1023 [1/263] 1015 [1/262] 1014 [1/248] 1007 [1/260]1022 [1/263] 1021 [1/251]

0 [1/0]

256 [1/47]

328 [1/79]

400 [1/95]

472 [1/117]

544 [1/118]

616 [1/163]

321 [1/79]

393 [1/95]

465 [1/130]

537 [1/144]

609 [1/149]

688 [1/167] 681 [1/165]

386 [1/80]

458 [1/96]

530 [1/141]

602 [1/146]

674 [1/164]

753 [1/168] 746 [1/182]760 [1/200]

451 [1/128]

523 [1/138]

595 [1/145]

667 [1/147]

739 [1/182]

818 [1/201] 825 [1/201]811 [1/198]

516 [1/128]

588 [1/142]

660 [1/174]

732 [1/180]

804 [1/198]

883 [1/217] 876 [1/215] 890 [1/217]

581 [1/157]

653 [1/173]

725 [1/175]

797 [1/180]

869 [1/213]

948 [1/218] 955 [1/250]941 [1/229]

646 [1/157]

718 [1/173]

790 [1/178]

862 [1/210]

934 [1/216]

1013 [1/251] 1006 [1/260]1020 [1/251]

711 [1/171]

783 [1/204]

855 [1/207]

927 [1/244]

999 [1/258]

1079 [1/279]1086 [1/283]1087 [1/284] 1078 [1/263] 1085 [1/263] 1071 [1/262]

320 [1/62]

392 [1/92]

464 [1/111]

536 [1/143]

608 [1/143]

680 [1/163]

385 [1/92]

457 [1/111]

529 [1/141]

601 [1/160]

673 [1/179]

752 [1/167] 745 [1/197]

450 [1/93]

522 [1/138]

594 [1/141]

666 [1/161]

738 [1/195]

817 [1/214] 824 [1/200] 810 [1/199]

515 [1/128]

587 [1/139]

659 [1/145]

731 [1/161]

803 [1/196]

882 [1/232] 875 [1/228] 889 [1/233]

580 [1/139]

652 [1/158]

724 [1/175]

796 [1/196]

868 [1/227]

947 [1/247] 954 [1/249]940 [1/229]

645 [1/158]

717 [1/175]

789 [1/177]

861 [1/212]

933 [1/229]

1012 [1/248] 1005 [1/248] 1019 [1/250]

710 [1/190]

782 [1/193]

854 [1/210]

926 [1/243]

998 [1/243]

1077 [1/278] 1084 [1/282] 1070 [1/277]

775 [1/193]

847 [1/207]

919 [1/241]

991 [1/244]

1063 [1/274]

1151 [1/284] 1143 [1/284] 1142 [1/283] 1135 [1/279]1150 [1/284] 1149 [1/283]

384 [1/79]

456 [1/95]

528 [1/130]

600 [1/144]

672 [1/163]

744 [1/196]

449 [1/127]

521 [1/130]

593 [1/144]

665 [1/176]

737 [1/195]

816 [1/200] 809 [1/197]

514 [1/128]

586 [1/157]

658 [1/174]

730 [1/176]

802 [1/195]

881 [1/231] 874 [1/228]888 [1/233]

579 [1/138]

651 [1/172]

723 [1/194]

795 [1/209]

867 [1/226]

946 [1/232] 953 [1/249]939 [1/228]

644 [1/157]

716 [1/192]

788 [1/206]

860 [1/225]

932 [1/227]

1011 [1/250] 1004 [1/259] 1018 [1/250]

709 [1/173]

781 [1/203]

853 [1/206]

925 [1/225]

997 [1/257]

1076 [1/278] 1083 [1/281]1069 [1/260]

774 [1/190]

846 [1/204]

918 [1/210]

990 [1/244]

1062 [1/260]

1141 [1/299] 1134 [1/277]1148 [1/282]

839 [1/204]

911 [1/207]

983 [1/244]

1055 [1/258]

1127 [1/274]

1207 [1/309]1214 [1/312]1215 [1/313] 1206 [1/309] 1213 [1/312] 1199 [1/308]

448 [1/92]

520 [1/111]

592 [1/143]

664 [1/160]

736 [1/179]

808 [1/197]

513 [1/127]

585 [1/141]

657 [1/160]

729 [1/179]

801 [1/197]

880 [1/231] 873 [1/214]

578 [1/138]

650 [1/172]

722 [1/174]

794 [1/208]

866 [1/208]

945 [1/247] 952 [1/233] 938 [1/232]

643 [1/139]

715 [1/191]

787 [1/194]

859 [1/209]

931 [1/246]

1010 [1/261] 1003 [1/259] 1017 [1/262]

708 [1/158]

780 [1/192]

852 [1/206]

924 [1/227]

996 [1/257]

1075 [1/261] 1082 [1/280]1068 [1/276]

773 [1/190]

845 [1/203]

917 [1/212]

989 [1/243]

1061 [1/257]

1140 [1/282] 1133 [1/296] 1147 [1/282]

838 [1/193]

910 [1/238]

982 [1/243]

1054 [1/271]

1126 [1/277]

1205 [1/308] 1212 [1/283] 1198 [1/296]

903 [1/235]

975 [1/241]

1047 [1/269]

1119 [1/274]

1191 [1/279]

1279 [1/313] 1271 [1/313] 1270 [1/319] 1263 [1/309]1278 [1/313] 1277 [1/319]

512 [1/127]

584 [1/130]

656 [1/144]

728 [1/176]

800 [1/211]

872 [1/227]

577 [1/156]

649 [1/171]

721 [1/176]

793 [1/195]

865 [1/226]

944 [1/233] 937 [1/231]

642 [1/171]

714 [1/174]

786 [1/194]

858 [1/209]

930 [1/228]

1009 [1/249] 1002 [1/232]1016 [1/249]

707 [1/189]

779 [1/194]

851 [1/209]

923 [1/242]

995 [1/246]

1074 [1/261] 1081 [1/280]1067 [1/259]

772 [1/192]

844 [1/206]

916 [1/225]

988 [1/242]

1060 [1/273]

1139 [1/281] 1132 [1/278] 1146 [1/281]

837 [1/203]

909 [1/237]

981 [1/255]

1053 [1/271]

1125 [1/273]

1204 [1/299] 1211 [1/311]1197 [1/299]

902 [1/204]

974 [1/253]

1046 [1/269]

1118 [1/293]

1190 [1/307]

1269 [1/312] 1262 [1/309]1276 [1/312]

967 [1/235]

1039 [1/267]

1111 [1/269]

1183 [1/306]

1255 [1/308]

576 [1/127]

648 [1/143]

720 [1/160]

792 [1/179]

864 [1/211]

936 [1/231]

641 [1/156]

713 [1/172]

785 [1/205]

857 [1/208]

929 [1/226]

1008 [1/247] 1001 [1/247]

706 [1/172]

778 [1/191]

850 [1/208]

922 [1/241]

994 [1/246]

1073 [1/262] 1080 [1/279] 1066 [1/261]

771 [1/191]

843 [1/223]

915 [1/240]

987 [1/246]

1059 [1/259]

1138 [1/298] 1131 [1/276] 1145 [1/300]

836 [1/192]

908 [1/236]

980 [1/254]

1052 [1/257]

1124 [1/276]

1203 [1/298] 1210 [1/310]1196 [1/296]

901 [1/203]

973 [1/238]

1045 [1/255]

1117 [1/271]

1189 [1/296]

1268 [1/319] 1261 [1/308] 1275 [1/311]

966 [1/238]

1038 [1/266]

1110 [1/291]

1182 [1/293]

1254 [1/307]

1031 [1/241]

1103 [1/288]

1175 [1/291]

1247 [1/306]

640 [1/156]

712 [1/190]

784 [1/205]

856 [1/211]

928 [1/245]

1000 [1/258]

705 [1/189]

777 [1/202]

849 [1/224]

921 [1/226]

993 [1/245]

1072 [1/277] 1065 [1/275]

770 [1/189]

842 [1/223]

914 [1/240]

986 [1/242]

1058 [1/272]

1137 [1/280] 1130 [1/275]1144 [1/280]

835 [1/221]

907 [1/236]

979 [1/242]

1051 [1/270]

1123 [1/273]

1202 [1/298] 1209 [1/310]1195 [1/281]

900 [1/221]

972 [1/237]

1044 [1/255]

1116 [1/273]

1188 [1/278]

1267 [1/311] 1260 [1/299] 1274 [1/311]

965 [1/237]

1037 [1/255]

1109 [1/271]

1181 [1/293]

1253 [1/307]

1030 [1/264]

1102 [1/269]

1174 [1/293]

1246 [1/307]

1095 [1/267]

1167 [1/288]

1239 [1/306]

704 [1/156]

776 [1/202]

848 [1/205]

920 [1/211]

992 [1/245]

1064 [1/274]

769 [1/189]

841 [1/222]

913 [1/239]

985 [1/256]

1057 [1/272]

1136 [1/297] 1129 [1/275]

834 [1/191]

906 [1/223]

978 [1/254]

1050 [1/270]

1122 [1/272]

1201 [1/300] 1208 [1/300] 1194 [1/298]

899 [1/235]

971 [1/253]

1043 [1/268]

1115 [1/270]

1187 [1/276]

1266 [1/318] 1259 [1/318] 1273 [1/319]

964 [1/236]

1036 [1/266]

1108 [1/268]

1180 [1/305]

1252 [1/317]

1029 [1/238]

1101 [1/287]

1173 [1/291]

1245 [1/316]

1094 [1/285]

1166 [1/291]

1238 [1/316]

1159 [1/288]

1231 [1/315]

768 [1/202]

840 [1/205]

912 [1/224]

984 [1/245]

1056 [1/258]

1128 [1/295]

833 [1/221]

905 [1/224]

977 [1/240]

1049 [1/256]

1121 [1/275]

1200 [1/297] 1193 [1/295]

898 [1/223]

970 [1/240]

1042 [1/254]

1114 [1/292]

1186 [1/306]

1265 [1/310] 1258 [1/318]1272 [1/310]

963 [1/236]

1035 [1/253]

1107 [1/290]

1179 [1/304]

1251 [1/317]

1028 [1/237]

1100 [1/266]

1172 [1/303]

1244 [1/305]

1093 [1/285]

1165 [1/302]

1237 [1/315]

1158 [1/285]

1230 [1/302] 1223 [1/288]

832 [1/202]

904 [1/222]

976 [1/239]

1048 [1/256]

1120 [1/294]

1192 [1/297]

897 [1/222]

969 [1/239]

1041 [1/256]

1113 [1/272]

1185 [1/294]

1264 [1/300] 1257 [1/318]

962 [1/235]

1034 [1/254]

1106 [1/270]

1178 [1/304]

1250 [1/317]

1027 [1/253]

1099 [1/268]

1171 [1/290]

1243 [1/305]

1092 [1/266]

1164 [1/287]

1236 [1/305]

1157 [1/287]

1229 [1/302]1222 [1/314]

896 [1/221]

968 [1/224]

1040 [1/267]

1112 [1/292]

1184 [1/295]

1256 [1/297]

961 [1/252]

1033 [1/265]

1105 [1/289]

1177 [1/292]

1249 [1/317]

1026 [1/264]

1098 [1/286]

1170 [1/303]

1242 [1/316]

1091 [1/264]

1163 [1/290]

1235 [1/315]

1156 [1/285]

1228 [1/315]1221 [1/302]

960 [1/222]

1032 [1/239]

1104 [1/289]

1176 [1/294]

1248 [1/316]

1025 [1/252]

1097 [1/265]

1169 [1/289]

1241 [1/304]

1090 [1/264]

1162 [1/286]

1234 [1/304]

1155 [1/268]

1227 [1/290]1220 [1/287]

1024 [1/252]

1096 [1/267]

1168 [1/292]

1240 [1/295]

1089 [1/265]

1161 [1/301]

1233 [1/303]

1154 [1/286]

1226 [1/303]1219 [1/314]

1088 [1/252]

1160 [1/289]

1232 [1/294]

1153 [1/265]

1225 [1/314]1218 [1/286]

1152 [1/301]

1224 [1/314] 1217 [1/301] 1216 [1/301]

(a)Total time = 393.100000 s

Wor

ker

0W

orke

r 1

Wor

ker

2W

orke

r 3

Wor

ker

4W

orke

r 5

Wor

ker

6W

orke

r 7

Sub

mit

edR

eady

3840

64

(b)Total time = 351.300000 s

Wor

ker

0W

orke

r 1

Wor

ker

2W

orke

r 3

Wor

ker

4W

orke

r 5

Wor

ker

6W

orke

r 7

Sub

mit

edR

eady

960

4

(c)

Figure A3 Example of the Polybench Jacobi 2D clustering and execution. The graph was generated withparameter iter = 10 and N = 10. The execution time obtained with granularity, (C), is slower than withoutgranularity, (B), because the overhead is limited and the dependencies makes it difficult to find a meta-graph where the parallelism is not constraint. (A) Clustered graph withM = 4. Original graph has 1,280nodes and estimated degree of parallelism of 56, clustered one has 320 nodes and estimated degree of par-allelism of 4.2. (B) Emulation of the execution of the original graph with 8 threads in 391 units of time.Each original task has a cost of 1, the overhead are 0 per task, 0.1 per push and 0.2 per pop. (C) Emulationof the execution of the clustered graph with 8 threads in 351.3 units of time. Each original task has a costof 1, the overhead are 0 per task, 0.1 per push and 0.2 per pop.

Full-size DOI: 10.7717/peerjcs.247/fig-9

ADDITIONAL INFORMATION AND DECLARATIONS

FundingThe authors received no funding for this work.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 22/26

Page 23: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Competing InterestsThe authors declare there are no competing interests.

Author Contributions• Bérenger Bramas conceived and designed the experiments, performed the experiments,analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/ortables, performed the computation work, authored or reviewed drafts of the paper,approved the final draft.• Alain Ketterlin analyzed the data, authored or reviewed drafts of the paper, approvedthe final draft.

Data AvailabilityThe following information was supplied regarding data availability:

An implementation of our method is publicly available at https://gitlab.inria.fr/bramas/dagpar. The graphs used in this project are available at https://figshare.com/projects/Dagpar/71198.

Supplemental InformationSupplemental information for this article can be found online at http://dx.doi.org/10.7717/peerj-cs.247#supplemental-information.

REFERENCESAgullo E, Aumage O, Bramas B, Coulaud O, Pitoiset S. 2017. Bridging the gap

between OpenMP and task-based runtime systems for the fast multipole method.IEEE Transactions on Parallel and Distributed Systems 28(10):2794–2807DOI 10.1109/TPDS.2017.2697857.

Agullo E, Bramas B, Coulaud O, Darve E, Messner M, Takahashi T. 2016. Task-basedFMM for heterogeneous architectures. Concurrency and Computation: Practice andExperience 28(9):2608–2629 DOI 10.1002/cpe.3723.

Agullo E, Buttari A, Guermouche A, Lopez F. 2015. Task-based multifrontal QR solverfor GPU-accelerated multicore architectures. In: 2015 IEEE 22nd internationalconference on high performance computing (HiPC). Piscataway: IEEE, 54–63DOI 10.1109/HiPC.2015.27.

Augonnet C, Thibault S, Namyst R,Wacrenier P-A. 2011. StarPU: a unified platformfor task scheduling on heterogeneous multicore architectures. Concurrency andComputation: Practice and Experience 23(2):187–198 DOI 10.1002/cpe.1631.

Bauer M, Treichler S, Slaughter E, Aiken A. 2012. Legion: expressing locality andindependence with logical regions. In: International conference on high performancecomputing, networking, storage and analysis. Piscataway: IEEE, 66.

Bramas B. 2016. Optimization and parallelization of the boundary element method forthe wave equation in time domain. PhD thesis, Bordeaux University.

Bramas B. 2019a. Impact study of data locality on task-based applications through theHeteroprio scheduler. PeerJ Computer Science 5:e190 DOI 10.7717/peerj-cs.190.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 23/26

Page 24: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Bramas B. 2019b. Increasing the degree of parallelism using speculative execution intask-based runtime systems. PeerJ Computer Science 5:e183 DOI 10.7717/peerj-cs.183.

Carpaye JMC, Roman J, Brenner P. 2018. Design and analysis of a task-basedparallelization over a runtime system of an explicit finite-volume CFD codewith adaptive time stepping. Journal of Computational Science 28:439–454DOI 10.1016/j.jocs.2017.03.008.

Cong J, Li Z, Bagrodia R. 1994. Acyclic multi-way partitioning of boolean networks. In:31st design automation conference. 670–675 DOI 10.1145/196244.196609.

Coulette D, Franck E, Helluy P, Mehrenberger M, Navoret L. 2019.High-order implicitpalindromic discontinuous Galerkin method for kinetic-relaxation approximation.Comput. & Fluids 190:485–502 DOI 10.1016/j.compfluid.2019.06.007.

Danalis A, Bosilca G, Bouteiller A, Herault T, Dongarra J. 2014. PTG: an abstractionfor unhindered parallelism. In: Proceedings of the fourth international workshop ondomain-specific languages and high-level frameworks for high performance computing,(WOLFHPC), IEEE. Piscataway: IEEE, 21–30.

Fiduccia CM,Mattheyses RM. 1982. A linear-time heuristic for improving network par-titions. In: 19th design automation conference. 175–181 DOI 10.1109/DAC.1982.1585498.

Gautier T, Lima JVF, Maillard N, Raffin B. 2013. XKaapi: a runtime system for data-flowtask programming on heterogeneous architectures. In: 2013 IEEE 27th internationalsymposium on parallel & distributed processing (IPDPS). IEEE, 1299–1308.

Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J. 2012. Auto-tuning ahigh-level language targeted to GPU codes. In: 2012 innovative parallel computing(InPar). 1–10 DOI 10.1109/InPar.2012.6339595.

Hendrickson B, Kolda TG. 2000. Graph partitioning models for parallel computing.Parallel Computing 26(12):1519–1534 DOI 10.1016/S0167-8191(00)00048-X.

Hendrickson B, Leland R. 1995. A multi-level algorithm for partitioning graphs. In:Supercomputing ’95:proceedings of the 1995 ACM/IEEE conference on supercomputing.Piscataway: IEEE, 28–28 DOI 10.1109/SUPERC.1995.242799.

Herrmann J, Kho J, Ucar B, Kaya K, Catalyurek U. 2017. Acyclic partitioning oflarge directed acyclic graphs. In: 2017 17th IEEE/ACM international symposiumon cluster, cloud and grid computing (CCGRID). Piscataway: IEEE, 371–380DOI 10.1109/CCGRID.2017.101.

Johnson DS, GareyMR. 1979. Computers and intractability: a guide to the theory of NP-completeness. New York: WH Freeman.

Karypis G, Aggarwa R, Kumar V, Shekhar S. 1999.Multilevel hypergraph partitioning:applications in VLSI domain. IEEE Transactions on Very Large Scale Integration(VLSI) Systems 7(1):69–79 DOI 10.1109/92.748202.

Karypis G, Kumar V. 1998. A fast and high quality multilevel scheme for parti-tioning irregular graphs. SIAM Journal on Scientific Computing 20(1):359–392DOI 10.1137/S1064827595287997.

Kernighan B, Lin S. 1970. An efficient heuristic procedure for partitioning graphs. TheBell System Technical Journal 49(2):291–307 DOI 10.1002/j.1538-7305.1970.tb01770.x.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 24/26

Page 25: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Kernighan BW. 1971. Optimal sequential partitions of graphs. Journal of the ACM18(1):34–40 DOI 10.1145/321623.321627.

Kestor G, Gioiosa R, Chavarra-Miranda D. 2015. Prometheus: scalable and accurate em-ulation of task-based applications on many-core systems. In: 2015 IEEE internationalsymposium on performance analysis of systems and software (ISPASS). Piscataway:IEEE, 308–317 DOI 10.1109/ISPASS.2015.7095816.

Moustafa S, KirschenmannW, Dupros F, Aochi H. 2018. Task-based programming onemerging parallel architectures for finite-differences seismic numerical kernel. In:Aldinucci M, Padovani L, Torquati M, eds. Euro-Par 2018: parallel processing. Cham:Springer International Publishing, 764–777.

Myllykoski M, Mikkelsen CCK. 2019. Introduction to StarNEig—a task-based library forsolving nonsymmetric eigenvalue problems. ArXiv preprint. arXiv:1905.04975.

OpenMP Architecture Review Board. 2013. OpenMP application program interfaceversion 4.0. Available at http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf .

Perez JM, Badia RM, Labarta J. 2008. A dependency-aware task-based programmingenvironment for multi-core architectures. In: 2008 IEEE international conference oncluster computing. IEEE, 142–151.

Pothen A, Alvarado LF. 1992. A fast reordering algorithm for parallel sparse triangularsolution. SIAM Journal on Scientific and Statistical Computing 13(2):645–653DOI 10.1137/0913036.

Purna KMG, Bhatia D. 1999. Temporal partitioning and scheduling data flow graphsfor reconfigurable computers. IEEE Transactions on Computers 48(6):579–590DOI 10.1109/12.773795.

Rossignon C. 2015. Un modéle de programmation á grain fin pour la parallélisation desolveurs linéaires creux. PhD thesis, Bordeaux University.

Rossignon C, Pascal H, Aumage O, Thibault S. 2013. A numa-aware fine grain par-allelization framework for multi-core architecture. In: 2013 IEEE internationalsymposium on parallel distributed processing, workshops and Phd forum. Piscataway:IEEE, 1381–1390 DOI 10.1109/IPDPSW.2013.204.

Sarkar V. 1989. Partitioning and scheduling parallel programs for multiprocessors. Cam-bridge: MIT Press.

Sarkar V, Hennessy J. 1986. Partitioning parallel programs for macro-dataflow.Technical report. Stanford Univ CA Computer Systems Lab.

Schaeffer SE. 2007. Survey: graph clustering. Computer Science Review 1(1):27–64DOI 10.1016/j.cosrev.2007.05.001.

Shun J, Roosta-Khorasani F, Fountoulakis K, MahoneyMW. 2016. Parallel local graphclustering. ArXiv preprint. arXiv:1604.07515.

Sukkari D, Ltaief H, Faverge M, Keyes D. 2018. Asynchronous task-based polar decom-position on single node manycore architectures. IEEE Transactions on Parallel andDistributed Systems 29(2):312–323 DOI 10.1109/TPDS.2017.2755655.

Suter F. 2013. DAGGEN: a synthethic task graph generator. Available at https:// github.com/ frs69wq/daggen (accessed on January 2019).

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 25/26

Page 26: Improving parallel executions by increasing task granularity in … · 2020. 1. 13. · Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding

Tagliavini G, Cesarini D, Marongiu A. 2018. Unleashing fine-grained paral-lelism on embedded many-core accelerators with lightweight OpenMP task-ing. IEEE Transactions on Parallel and Distributed Systems 29(9):2150–2163DOI 10.1109/TPDS.2018.2814602.

Tillenius M. 2015. Superglue: a shared memory framework using data versioning fordependency-aware task-based parallelization. SIAM Journal on Scientific Computing37(6):C617–C642 DOI 10.1137/140989716.

XuD, Tian Y. 2015. A comprehensive survey of clustering algorithms. Annals of DataScience 2(2):165–193 DOI 10.1007/s40745-015-0040-1.

Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 26/26