Upload
brandi
View
53
Download
0
Embed Size (px)
DESCRIPTION
Satisfying Your Dependencies with SuperMatrix. Ernie Chan. Motivation. Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via dependency analysis Programmability - PowerPoint PPT Presentation
Citation preview
September 17-20, 2007 Cluster 2007 1
Satisfying Your Dependencies with SuperMatrix
Ernie Chan
September 17-20, 2007 Cluster 2007 2
Motivation
Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures Schedule submatrix operations out-of-order via
dependency analysis
Programmability High-level abstractions to hide details of
parallelization from user
September 17-20, 2007 Cluster 2007 3
Outline
SuperMatrixImplementationPerformance ResultsConclusion
September 17-20, 2007 Cluster 2007 4
SuperMatrix
September 17-20, 2007 Cluster 2007 5
SuperMatrix
FLA_Part_2x2( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&
FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )
{
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
b, b, FLA_BR );
/*------------------------------------------------------------------*/
FLA_LU_nopiv( A11 );
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 );
FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLA_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );
/*------------------------------------------------------------------*/
FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,
A10, A11, /**/ A12,
/* ************** */ /* ****************** */
&ABL, /**/ &ABR, A20, A21, /**/ A22,
FLA_TL );
}
September 17-20, 2007 Cluster 2007 6
SuperMatrix
LU Factorization Without Pivoting Iteration 1
LU
TRSM
TRSM
GEMMTRSM
TRSM
GEMM
GEMM
GEMM
September 17-20, 2007 Cluster 2007 7
SuperMatrix
LU Factorization Without Pivoting Iteration 2
LU
GEMM
TRSM
TRSM
September 17-20, 2007 Cluster 2007 8
SuperMatrix
LU Factorization Without Pivoting Iteration 3
LU
September 17-20, 2007 Cluster 2007 9
SuperMatrix
FLASH Matrix of matrices
September 17-20, 2007 Cluster 2007 10
SuperMatrix
FLA_Part_2x2( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) &&
FLA_Obj_width ( ATL ) < FLA_Obj_width ( A ) )
{
FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------------*/
FLASH_LU_nopiv( A11 );
FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 );
FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 );
/*------------------------------------------------------------------*/
FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02,
A10, A11, /**/ A12,
/* ************** */ /* ****************** */
&ABL, /**/ &ABR, A20, A21, /**/ A22,
FLA_TL );
}
FLASH_Queue_exec( );
September 17-20, 2007 Cluster 2007 11
SuperMatrix
Analyzer Delay execution and place tasks on queue
Tasks are function pointers annotated with input/output information
Compute dependence information (flow, anti, output) between all tasks
Create DAG of tasks
September 17-20, 2007 Cluster 2007 12
SuperMatrix
Dispatcher Use DAG to execute tasks out-of-order in
parallel Akin to Tomasulo’s algorithm and instruction-
level parallelism on blocks of computation SuperScalar vs. SuperMatrix
September 17-20, 2007 Cluster 2007 13
SuperMatrix
Dispatcher 4 threads 5 x 5 matrix
of blocks 55 tasks 18 stages
LU
TRSMGEMM
LU
LU
LU
LU
TRSM TRSM
TRSMTRSM
TRSMTRSMTRSMTRSM TRSM
TRSM
TRSM TRSM TRSM
TRSMTRSM TRSM
GEMM GEMMGEMMGEMM GEMMGEMM GEMMGEMM GEMM GEMM GEMM
GEMMGEMM
GEMM
GEMM GEMM
GEMM
GEMM
GEMM
GEMM GEMM
GEMMGEMM
GEMMGEMM GEMMGEMM
GEMMGEMM
TRSMTRSM
TRSM
September 17-20, 2007 Cluster 2007 14
Outline
SuperMatrixImplementationPerformance ResultsConclusion
September 17-20, 2007 Cluster 2007 15
Implementation
Analyzer
LU
GEMM
TRSMTRSM
GEMMGEMMGEMM
TRSMTRSM
LU
LUTRSMTRSMGEMM
Task Queue DAG of tasks
LU
TRSM
TRSM
TRSM
TRSMTRSM TRSM
LU
LU
GEMM GEMM GEMMGEMM
GEMM
September 17-20, 2007 Cluster 2007 16
Implementation
Analyzer FLASH routines enqueue tasks onto global task
queue Dependencies between each task are
calculated and stored in the task structure Each submatrix block stores the last task enqueued
that writes to it Flow dependencies occur when a subsequent task
reads that block DAG is embedded in task queue
September 17-20, 2007 Cluster 2007 17
Implementation
Dispatcher
Waiting Queue…
Threads
LU
GEMM
TRSMTRSM
GEMMGEMMGEMM
TRSMTRSM
LU
LUTRSMTRSMGEMM
Task Queue
LU
TRSMTRSMTRSM
TRSMLU TRSM TRSM TRSMTRSM
September 17-20, 2007 Cluster 2007 18
Implementation
Dispatcher Place ready and available tasks on global
waiting queue First task on task queue always ready and
available Threads asynchronously dequeue tasks from
head of waiting queue Once a task completes execution, notify
dependent tasks and update waiting queue Loop until all tasks complete execution
September 17-20, 2007 Cluster 2007 19
Outline
SuperMatrixImplementationPerformance ResultsConclusion
September 17-20, 2007 Cluster 2007 20
Performance Results
Target Architectures
Processing Elements
Peak (GFLOPS)
BLAS Libraries
Itanium2 16 96.0 MKL 8.1
Xeon 8 41.6 MKL 9.0
Opteron 8 41.6 ACML 3.6
POWER5 8 60.8 ESSL 4.2
September 17-20, 2007 Cluster 2007 21
Performance Results
GotoBLAS 1.13 installed on all machinesSupported Operations
LAPACK-level functions Cholesky factorization LU factorization without pivoting
All level-3 BLAS GEMM, TRMM, TRSM SYMM, SYRK, SYR2K HEMM, HERK, HER2K
September 17-20, 2007 Cluster 2007 22
Performance Results
Implementations SuperMatrix + serial BLAS FLAME + multithreaded BLAS LAPACK + multithreaded BLAS
Block size = 192 Processing elements = 8
September 17-20, 2007 Cluster 2007 23
Performance Results
SuperMatrix Implementation Fixed block sized
Varying block sizes can lead to better performance Experiments show 192 generally the best
Simplest scheduling No sorting to execute task on critical path earlier No attempt to improve data locality in these
experiments
September 17-20, 2007 Cluster 2007 24
Performance Results
September 17-20, 2007 Cluster 2007 25
Performance Results
September 17-20, 2007 Cluster 2007 26
Performance Results
September 17-20, 2007 Cluster 2007 27
Performance Results
September 17-20, 2007 Cluster 2007 28
Performance Results
September 17-20, 2007 Cluster 2007 29
Performance Results
September 17-20, 2007 Cluster 2007 30
Outline
SuperMatrixImplementationPerformance ResultsConclusion
September 17-20, 2007 Cluster 2007 31
Conclusion
Apply out-of-order execution techniques to schedule tasks
The whole is greater than the sum of the parts Exploit parallelism between operations
Despite having to calculate dependencies, SuperMatrix only has small performance penalties
September 17-20, 2007 Cluster 2007 32
Conclusion
Programmability Code at a high level without needing to deal
with aspects of parallelization
September 17-20, 2007 Cluster 2007 33
Authors
Ernie ChanField G. Van ZeeEnrique S. Quintana-OrtíGregorio Quintana-OrtíRobert van de Geijn
The University of Texas at Austin Universidad Jaume I
September 17-20, 2007 Cluster 2007 34
Acknowledgements
We thank the Texas Advanced Computing Center (TACC) for access to their machines and their support
Funding NSF Grants
CCF—0540926 CCF—0702714
September 17-20, 2007 Cluster 2007 35
References
[1] Ernie Chan, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix Out-of-Order Scheduling of Matrix Operations on SMP and Multi-Core Architectures. In SPAA ‘07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 2007.
[2] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Ortí, Gregorio Quintana-Ortí, and Robert van de Geijn. SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks. Submitted to PPoPP 2008.
[3] Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan, Robert A. van de Geijn, and Field G. Van Zee. Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures. Submitted to Euromicro PDP 2008.
September 17-20, 2007 Cluster 2007 36
Conclusion
More Information
http://www.cs.utexas.edu/users/flame
Questions?