Satisfying Your Dependencies with SuperMatrix

September 17-20, 2007 Cluster 2007 1

Satisfying Your Dependencies with SuperMatrix

Ernie Chan


Motivation

Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via

dependency analysis

Programmability High-level abstractions to hide details of

parallelization from user


Outline

SuperMatrixImplementationPerformance ResultsConclusion


SuperMatrix


SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&

FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )

{

b = min( FLA_Obj_length( ABR ), nb_alg );

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

b, b, FLA_BR );

/*------------------------------------------------------------------*/

FLA_LU_nopiv( A11 );

FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,

FLA_ONE, A11, A12 );

FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,


FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );

/*------------------------------------------------------------------*/

FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,

A10, A11, /**/ A12,

/* ************** */ /* ****************** */

&ABL, /**/ &ABR, A20, A21, /**/ A22,

FLA_TL );

}


SuperMatrix

LU Factorization Without Pivoting Iteration 1

LU

TRSM

TRSM

GEMMTRSM

TRSM

GEMM

GEMM

GEMM


SuperMatrix


LU

GEMM

TRSM

TRSM


SuperMatrix


LU


SuperMatrix

FLASH Matrix of matrices


SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&

FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )

{

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,

/* ************* */ /* ******************** */

&A10, /**/ &A11, &A12,

ABL, /**/ ABR, &A20, /**/ &A21, &A22,

1, 1, FLA_BR );

/*------------------------------------------------------------------*/

FLASH_LU_nopiv( A11 );

FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,


FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,

FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,


FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );

/*------------------------------------------------------------------*/

FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,

A10, A11, /**/ A12,

/* ************** */ /* ****************** */

&ABL, /**/ &ABR, A20, A21, /**/ A22,

FLA_TL );

}

FLASH_Queue_exec( );


SuperMatrix

Analyzer Delay execution and place tasks on queue

Tasks are function pointers annotated with input/output information

Compute dependence information (flow, anti, output) between all tasks

Create DAG of tasks


SuperMatrix

Dispatcher Use DAG to execute tasks out-of-order in

parallel Akin to Tomasulo’s algorithm and instruction-

level parallelism on blocks of computation SuperScalar vs. SuperMatrix


SuperMatrix

Dispatcher 4 threads 5 x 5 matrix

of blocks 55 tasks 18 stages

LU

TRSMGEMM

LU

LU

LU

LU

TRSM TRSM

TRSMTRSM

TRSMTRSMTRSMTRSM TRSM

TRSM

TRSM TRSM TRSM

TRSMTRSM TRSM

GEMM GEMMGEMMGEMM GEMMGEMM GEMMGEMM GEMM GEMM GEMM

GEMMGEMM

GEMM

GEMM GEMM

GEMM

GEMM

GEMM

GEMM GEMM

GEMMGEMM

GEMMGEMM GEMMGEMM

GEMMGEMM

TRSMTRSM

TRSM


Outline



Implementation

Analyzer

LU

GEMM

TRSMTRSM

GEMMGEMMGEMM

TRSMTRSM

LU

LUTRSMTRSMGEMM

Task Queue DAG of tasks

LU

TRSM

TRSM

TRSM

TRSMTRSM TRSM

LU

LU

GEMM GEMM GEMMGEMM

GEMM


Implementation

Analyzer FLASH routines enqueue tasks onto global task

queue Dependencies between each task are

calculated and stored in the task structure Each submatrix block stores the last task enqueued

that writes to it Flow dependencies occur when a subsequent task

reads that block DAG is embedded in task queue


Implementation

Dispatcher

Waiting Queue…

Threads

LU

GEMM

TRSMTRSM

GEMMGEMMGEMM

TRSMTRSM

LU

LUTRSMTRSMGEMM

Task Queue

LU

TRSMTRSMTRSM

TRSMLU TRSM TRSM TRSMTRSM


Implementation

Dispatcher Place ready and available tasks on global

waiting queue First task on task queue always ready and

available Threads asynchronously dequeue tasks from

head of waiting queue Once a task completes execution, notify

dependent tasks and update waiting queue Loop until all tasks complete execution


Outline



Performance Results

Target Architectures

Processing Elements

Peak (GFLOPS)

BLAS Libraries

Itanium2 16 96.0 MKL 8.1

Xeon 8 41.6 MKL 9.0

Opteron 8 41.6 ACML 3.6

POWER5 8 60.8 ESSL 4.2


Performance Results

GotoBLAS 1.13 installed on all machinesSupported Operations

LAPACK-level functions Cholesky factorization LU factorization without pivoting

All level-3 BLAS GEMM, TRMM, TRSM SYMM, SYRK, SYR2K HEMM, HERK, HER2K


Performance Results

Implementations SuperMatrix + serial BLAS FLAME + multithreaded BLAS LAPACK + multithreaded BLAS

Block size = 192 Processing elements = 8


Performance Results

SuperMatrix Implementation Fixed block sized

Varying block sizes can lead to better performance Experiments show 192 generally the best

Simplest scheduling No sorting to execute task on critical path earlier No attempt to improve data locality in these

experiments


Performance Results


Performance Results


Performance Results


Performance Results


Performance Results


Performance Results


Outline



Conclusion

Apply out-of-order execution techniques to schedule tasks

The whole is greater than the sum of the parts Exploit parallelism between operations

Despite having to calculate dependencies, SuperMatrix only has small performance penalties


Conclusion

Programmability Code at a high level without needing to deal

with aspects of parallelization


Authors

Ernie ChanField G. Van ZeeEnrique S. Quintana-OrtíGregorio Quintana-OrtíRobert van de Geijn

The University of Texas at Austin Universidad Jaume I


Acknowledgements

We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support

Funding NSF Grants

CCF—0540926 CCF—0702714


References

[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.

[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008.

[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008.


Conclusion

More Information

http://www.cs.utexas.edu/users/flame

Questions?

[email protected]

Documents

Satisfying Your Dependencies with SuperMatrix