45
On the memory behavior of a multifrontal QR software for multicore systems Alfredo Buttari, CNRS-IRIT Toulouse, September 6-7, 2011

Multithreaded sparse QR

Embed Size (px)

DESCRIPTION

The talk I gave at the Sparse Days, 2011 edition

Citation preview

Page 1: Multithreaded sparse QR

On the memory behavior of amultifrontal QR softwarefor multicore systems

Alfredo Buttari, CNRS-IRIT

Toulouse, September 6-7, 2011

Page 2: Multithreaded sparse QR

The multifrontal QR method

Page 3: Multithreaded sparse QR

The Multifrontal QR for newbies

The multifrontal QR factorization isguided by a graph called eliminationtree:• at each node of the tree k pivots areeliminated

• each node of the tree is associatedwith a relatively small dense matrixcalled frontal matrix (or, simply, front)which contains the k columns relatedto the pivots and all the othercoefficients concerned by theirelimination

3/25 Toulouse, September 6-7, 2011

Page 4: Multithreaded sparse QR

The Multifrontal QR for newbies

The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:

• assembly: a set of coefficient from theoriginal matrix associated with the pivotsand a number of contribution blocksproduced by the treatment of the childnodes are assembled together to formthe frontal matrix

• factorization: the k pivots are eliminatedthrough a complete QR factorization ofthe frontal matrix. As a result we get:◦ k rows of the global R factor◦ a bunch of Householder vectors◦ a triangular contribution block that will

be assembled into the father’s front

4/25 Toulouse, September 6-7, 2011

Page 5: Multithreaded sparse QR

The Multifrontal QR for newbies

The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:

• assembly: a set of coefficient from theoriginal matrix associated with the pivotsand a number of contribution blocksproduced by the treatment of the childnodes are assembled together to formthe frontal matrix

• factorization: the k pivots are eliminatedthrough a complete QR factorization ofthe frontal matrix. As a result we get:◦ k rows of the global R factor◦ a bunch of Householder vectors◦ a triangular contribution block that will

be assembled into the father’s front

4/25 Toulouse, September 6-7, 2011

Page 6: Multithreaded sparse QR

The Multifrontal QR for newbies

The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:

• assembly: a set of coefficient from theoriginal matrix associated with the pivotsand a number of contribution blocksproduced by the treatment of the childnodes are assembled together to formthe frontal matrix

• factorization: the k pivots are eliminatedthrough a complete QR factorization ofthe frontal matrix. As a result we get:◦ k rows of the global R factor◦ a bunch of Householder vectors◦ a triangular contribution block that will

be assembled into the father’s front

4/25 Toulouse, September 6-7, 2011

Page 7: Multithreaded sparse QR

Multifrontal QR, parallelism

Page 8: Multithreaded sparse QR

The Multifrontal QR: parallelism

Two sources of parallelism are available in any multifrontal method:

Tree parallelism

• fronts associated with nodes in different branches areindependent and can, thus, be factorized in parallel

Front parallelism

• if the size of a front is big enough, multiple processes may beused to factorize it

6/25 Toulouse, September 6-7, 2011

Page 9: Multithreaded sparse QR

Parallelism: classical approach

The classical approach (Puglisi, Matstom, Davis)

• Tree parallelism:◦ a front assembly+factorization corresponds to a task◦ computational tasks are added to a task pool◦ threads fetch tasks from the pool repeatedly until all the fronts are

done

• Front parallelism:◦ Multithreaded BLAS for the front facto

What’s wrong with this approach? A complete separation of thetwo levels of parallelism which causes• potentially strong load unbalance• heavy synchronizations due to the sequential nature of someoperations (assembly)

• sub-optimal exploitation of the concurrency in the multifrontalmethod7/25 Toulouse, September 6-7, 2011

Page 10: Multithreaded sparse QR

Parallelism: a new approach

fine-grained, data-flow parallel approach

• fine granularity: tasks are not defined asoperations on fronts but as operations onportions of fronts defined by a 1-D partitioning

• data flow parallelism: tasks are scheduleddynamically based on the dependenciesbetween them

Both node and tree parallelism are handled the same way at anylevel of the tree.

8/25 Toulouse, September 6-7, 2011

Page 11: Multithreaded sparse QR

Fine grained, asynchronous,parallel QR

Page 12: Multithreaded sparse QR

Parallelism: a new approach

Fine-granularity is achieved through a 1-D block partitioning offronts and the definition of five elementary operations:

1. activate(front): the activation of a frontcorresponds to a full determination of its(staircase) structure and allocation of theneeded memory areas

2. panel(bcol): QR factorization (Level2BLAS) of a column

3. update(bcol): update of a column in thetrailing submatrix wrt to a panel

4. assemble(bcol): assembly of a column ofthe contribution block into the father

5. clean(front): cleanup the front in orderto release all the memory areas that are nomore needed

10/25 Toulouse, September 6-7, 2011

Page 13: Multithreaded sparse QR

Parallelism: a new approach

If a task is defined as the execution of one elementary operationon a block-column or a front, then the entire multifrontalfactorization can be represented as a Directed Acyclic Graph(DAG)

where the nodes represent tasks and edges the dependenciesamong them

11/25 Toulouse, September 6-7, 2011

Page 14: Multithreaded sparse QR

Parallelism: a new approach

• d1: no other elementary operation can be executed on a front or on one ofits block-columns until the front is not activated;

• d2: a block column can be updated with respect to a panel only if thecorresponding panel factorization is completed;

• d3: the panel operation can be executed on block-column i only if it isup-to-date with respect to panel i− 1;

• d4: a block-column can be updated with respect to a panel i in its front onlyif it is up-to-date with respect to the previous panel i− 1 in the same front;

• d5: a block-column can be assembled into the parent (if it exists) when it isup-to-date with respect to the last panel factorization to be performed on thefront it belongs to (in this case it is assumed that block-column i isup-to-date with respect to panel i when the corresponding panel operation isexecuted);

• d6: no other elementary operation can be executed on a block-column untilall the corresponding portions of the contribution blocks from the childnodes have been assembled into it, in which case the block-column is said tobe assembled;

• d7: since the structure of a frontal matrix depends on the structure of itschildren, a front matrix can be activated only if all of its children are alreadyactive;12/25 Toulouse, September 6-7, 2011

Page 15: Multithreaded sparse QR

Parallelism: a new approach

Tasks are scheduled dynamically and asynchronously

13/25 Toulouse, September 6-7, 2011

Page 16: Multithreaded sparse QR

Results

14/25 Toulouse, September 6-7, 2011

Page 17: Multithreaded sparse QR

Tasks scheduling

Page 18: Multithreaded sparse QR

Target architecture

Our target architecture has four, hexa-core AMD Istanbulprocessors connected through HyperTransport links in a ringtopology with a memory module attached to each of them:

The bandwidth depends on the number of hops

16/25 Toulouse, September 6-7, 2011

Page 19: Multithreaded sparse QR

Target architecture

Our target architecture has four, hexa-core AMD Istanbulprocessors connected through HyperTransport links in a ringtopology with a memory module attached to each of them:

The bandwidth depends on the number of hops16/25 Toulouse, September 6-7, 2011

Page 20: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:

◦ a thread will first try to steal tasks fromneighbor queues...

◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 21: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:

◦ a thread will first try to steal tasks fromneighbor queues...

◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 22: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:

◦ a thread will first try to steal tasks fromneighbor queues...

◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 23: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:

◦ a thread will first try to steal tasks fromneighbor queues...

◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 24: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:◦ a thread will first try to steal tasks from

neighbor queues...

◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 25: Multithreaded sparse QR

Proximity scheduling

The scheduling can be guided by a proximity criterion: a taskshould be executed by the core which is closest to the concerneddata. This can be implemented through a system of task queues,one per thread/core:

• At the moment when a thread activatesa front it becomes its owner

• All the tasks related to a front will bepushed on the queue associated with itsowner

• Work-stealing is used to feed threadsthat run out of tasks:◦ a thread will first try to steal tasks from

neighbor queues...◦ ...and then from any other queue

No front-to-core mapping is done (yet)!17/25 Toulouse, September 6-7, 2011

Page 26: Multithreaded sparse QR

Proximity scheduling

Implementing all this requires the ability to:

• control the placement of threads: we have to bind each threadto a single core and prevent threads migrations. This can bedone in a number of ways, e.g. by means of tools such ashwloc which allows thread pinning

• control the placement of data: we have to make sure that onefront physically resides on a specific NUMA module. This canbe done with:◦ the first touch rule: the data is allocated close to the core that

makes the first reference◦ hwloc or numalib which provide NUMA-aware allocators

• detect the architecture we have to figure out the memory/coreslayout in order to guide the work stealing. This can be donewith hwloc

18/25 Toulouse, September 6-7, 2011

Page 27: Multithreaded sparse QR

Proximity scheduling

Implementing all this requires the ability to:• control the placement of threads: we have to bind each threadto a single core and prevent threads migrations. This can bedone in a number of ways, e.g. by means of tools such ashwloc which allows thread pinning

• control the placement of data: we have to make sure that onefront physically resides on a specific NUMA module. This canbe done with:◦ the first touch rule: the data is allocated close to the core that

makes the first reference◦ hwloc or numalib which provide NUMA-aware allocators

• detect the architecture we have to figure out the memory/coreslayout in order to guide the work stealing. This can be donewith hwloc

18/25 Toulouse, September 6-7, 2011

Page 28: Multithreaded sparse QR

Proximity scheduling

Implementing all this requires the ability to:• control the placement of threads: we have to bind each threadto a single core and prevent threads migrations. This can bedone in a number of ways, e.g. by means of tools such ashwloc which allows thread pinning

• control the placement of data: we have to make sure that onefront physically resides on a specific NUMA module. This canbe done with:◦ the first touch rule: the data is allocated close to the core that

makes the first reference◦ hwloc or numalib which provide NUMA-aware allocators

• detect the architecture we have to figure out the memory/coreslayout in order to guide the work stealing. This can be donewith hwloc

18/25 Toulouse, September 6-7, 2011

Page 29: Multithreaded sparse QR

Proximity scheduling

Implementing all this requires the ability to:• control the placement of threads: we have to bind each threadto a single core and prevent threads migrations. This can bedone in a number of ways, e.g. by means of tools such ashwloc which allows thread pinning

• control the placement of data: we have to make sure that onefront physically resides on a specific NUMA module. This canbe done with:◦ the first touch rule: the data is allocated close to the core that

makes the first reference◦ hwloc or numalib which provide NUMA-aware allocators

• detect the architecture we have to figure out the memory/coreslayout in order to guide the work stealing. This can be donewith hwloc

18/25 Toulouse, September 6-7, 2011

Page 30: Multithreaded sparse QR

Proximity scheduling: experiments

Experimental results show that proximity scheduling is good:

Matrix Strat. Time (s)

Dwords on HT(×109)

Rucci1no loc. 155

2000

loc. 144

1634

ohne2no loc. 43

504

loc. 39

375

lp nug20no loc. 89

879

loc. 84

778

Why?

Fewer Dwords transferred through the HyperTransport links,as shown by the number of occurrences of the PAPIHYPERTRANSPORT LINKx:DATA DWORD SENT event.

But this not the whole story!

19/25 Toulouse, September 6-7, 2011

Page 31: Multithreaded sparse QR

Proximity scheduling: experiments

Experimental results show that proximity scheduling is good:

Matrix Strat. Time (s) Dwords on HT(×109)

Rucci1no loc. 155 2000

loc. 144 1634

ohne2no loc. 43 504

loc. 39 375

lp nug20no loc. 89 879

loc. 84 778

Why? Fewer Dwords transferred through the HyperTransport links,as shown by the number of occurrences of the PAPIHYPERTRANSPORT LINKx:DATA DWORD SENT event.

But this not the whole story!

19/25 Toulouse, September 6-7, 2011

Page 32: Multithreaded sparse QR

Proximity scheduling: experiments

Experimental results show that proximity scheduling is good:

Matrix Strat. Time (s) Dwords on HT(×109)

Rucci1no loc. 155 2000

loc. 144 1634

ohne2no loc. 43 504

loc. 39 375

lp nug20no loc. 89 879

loc. 84 778

Why? Fewer Dwords transferred through the HyperTransport links,as shown by the number of occurrences of the PAPIHYPERTRANSPORT LINKx:DATA DWORD SENT event.

But this not the whole story!19/25 Toulouse, September 6-7, 2011

Page 33: Multithreaded sparse QR

Target architecture

What happens when more threads are active at the same time onthe system?

BW drops; why? memory conflicts!

20/25 Toulouse, September 6-7, 2011

Page 34: Multithreaded sparse QR

Target architecture

What happens when more threads are active at the same time onthe system?

BW drops; why?

memory conflicts!

20/25 Toulouse, September 6-7, 2011

Page 35: Multithreaded sparse QR

Target architecture

What happens when more threads are active at the same time onthe system?

BW drops; why? memory conflicts!20/25 Toulouse, September 6-7, 2011

Page 36: Multithreaded sparse QR

Experiments on conflicts

Should we do something to reduce conflicts?

Memory interleaving should provide a more uniform distributionof data that (supposedly) reduces conflicts and increases thememory bandwidth

21/25 Toulouse, September 6-7, 2011

Page 37: Multithreaded sparse QR

Experiments on conflicts

Should we do something to reduce conflicts?

Memory interleaving should provide a more uniform distributionof data that (supposedly) reduces conflicts and increases thememory bandwidth

21/25 Toulouse, September 6-7, 2011

Page 38: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT

Conflicts on DCT

(×109)

(×109)

no loc. 155 2000

23.8

Rucci1 loc. 144 1634

25.4r. r. 117 2306 19.7

no loc. 43 504

6.63

ohne2 loc. 39 375

6.71r. r. 38 665 4.54

no loc. 89 879

11.4

lp nug20 loc. 84 778

11.9r. r. 66 985 6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPIAnalogous behavior was found on the HSL MA87 code.The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 39: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT

Conflicts on DCT

(×109)

(×109)

no loc. 155 2000

23.8

Rucci1 loc. 144 1634

25.4

r. r. 117

2306 19.7

no loc. 43 504

6.63

ohne2 loc. 39 375

6.71

r. r. 38

665 4.54

no loc. 89 879

11.4

lp nug20 loc. 84 778

11.9

r. r. 66

985 6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPIAnalogous behavior was found on the HSL MA87 code.The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 40: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT

Conflicts on DCT

(×109)

(×109)

no loc. 155 2000

23.8

Rucci1 loc. 144 1634

25.4

r. r. 117 2306

19.7

no loc. 43 504

6.63

ohne2 loc. 39 375

6.71

r. r. 38 665

4.54

no loc. 89 879

11.4

lp nug20 loc. 84 778

11.9

r. r. 66 985

6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPIAnalogous behavior was found on the HSL MA87 code.The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 41: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT Conflicts on DCT(×109) (×109)

no loc. 155 2000 23.8Rucci1 loc. 144 1634 25.4

r. r. 117 2306 19.7

no loc. 43 504 6.63ohne2 loc. 39 375 6.71

r. r. 38 665 4.54

no loc. 89 879 11.4lp nug20 loc. 84 778 11.9

r. r. 66 985 6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPI

Analogous behavior was found on the HSL MA87 code.The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 42: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT Conflicts on DCT(×109) (×109)

no loc. 155 2000 23.8Rucci1 loc. 144 1634 25.4

r. r. 117 2306 19.7

no loc. 43 504 6.63ohne2 loc. 39 375 6.71

r. r. 38 665 4.54

no loc. 89 879 11.4lp nug20 loc. 84 778 11.9

r. r. 66 985 6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPIAnalogous behavior was found on the HSL MA87 code.

The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 43: Multithreaded sparse QR

Experiments on conflicts

Matrix Strat. Time (s) Dwords on HT Conflicts on DCT(×109) (×109)

no loc. 155 2000 23.8Rucci1 loc. 144 1634 25.4

r. r. 117 2306 19.7

no loc. 43 504 6.63ohne2 loc. 39 375 6.71

r. r. 38 665 4.54

no loc. 89 879 11.4lp nug20 loc. 84 778 11.9

r. r. 66 985 6.81

Conflicts are DRAM ACCESSES PAGE:DCTx PAGE CONFLICT eventsmeasured by PAPIAnalogous behavior was found on the HSL MA87 code.The answer is: yes, we should reduce conflicts (work in progress).

22/25 Toulouse, September 6-7, 2011

Page 44: Multithreaded sparse QR

Conclusions

• fine-granularity, asynchronism and data-flow parallelism aretough (especially all together) but the effort is largely paid off

• a correct exploitation of the memory system is critical,especially in NUMA systems

• reducing memory transfers gives some performanceimprovement but...

• ...maybe it is not the most important thing: memory conflictsseem to be waaay more penalizing than data traffic. How do wesolve this? Not clear for the moment. Maybe a carefulfront-to-memory mapping? Ideas and suggestions are verywelcome

23/25 Toulouse, September 6-7, 2011

Page 45: Multithreaded sparse QR

? Thank you all!Questions?