Modern-Linear-AlgebraLibraries- for-Graphics-Processors- Webseite/pdf/EU Regional School... · RWTH Aachen, March2011-HPC&A-–UJI--11-Multi-threaded Architectures Evolution in Processor

RWTH-‐Aachen, March 2011 HPC&A – UJI 1

Enrique S. Quintana-Ortí

Modern Linear Algebra Libraries for Graphics Processors


Index

(Tentative) 1.  Multi-threaded architectures 40m 2.  Dense linear algebra for multi-core 40m 3.  Programming in CUDA 40m 4.  Dense linear algebra for GPUs 1,5h 5.  Summary


Index

1.  Multi-threaded architectures 1.  Evolution in processor performance 2.  Multi-core processors 3.  Multi-core processors vs. many-core GPUs 4.  Present and future heterogeneity

2.  Dense linear algebra for multi-core 3.  Programming in CUDA 4.  Dense linear algebra for GPUs 5.  Summary


Multi-threaded Architectures Evolution in Processor Performance

“Computer Architecture: A QuanMtaMve Approach” J. Hennessy, D. PaSerson, 2008



  RISC Architectures: “The attack of the killer micros” (E. Brooks, 1989) or “Cray on a chip” (Intel i860)   CMOS technology   Pipelined and superscalar processors   High frequency   Large, multi-level cache memories   SIMD FP units



“Computer Architecture: A QuanMtaMve Approach” J. Hennessy, D. PaSerson, 2008



  “The free lunch is over” (H. Sutter, 2005)

Frequency wall Instruction-level parallelism (ILP) wall Memory wall



  Frequency wall (end of the GHz race):

  Energy consumption proportional to f2   Electricity is expensive and produces heat



  ILP wall:

  On average, 1 branch every 5 instructions   Control hazards block the pipeline   Branch prediction is costly

IF ID EX WB

IF ID EX WB

IF ID EX WB

IF ID EX WB



  Memory wall:   Memory access

latency has not changed: 1 memory access ≈ 240 cycles (2008)

  By design, large caches are slow



  …but Moore’s Law still holds

≡ The number of transistors that can be integrated in a chip (with a constant cost) is doubled every

1.5-2 years

"Cramming more components onto integrated circuits", G. E. Moore, 1965: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to conEnue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.



  What can we do with these transistors?   Smaller chips to increase frequency?   More complex architectures to further exploit ILP?   Larger caches to hide memory latency?

NO!


Multi-threaded Architectures Multi-core Processors

Performance

1980 2005 2015? 2025?? 2035???

Single thread performance More acMve transistors Higher frequency

P. Hofstee, IBM AusMn

MulM-‐core processors More acMve transistors Higher frequency

Exploit ILP: Minimize response Mme

Exploit TLP: While a thread Is blocked waiMng for memory, execute an alternaMve ready thread Maximize throughput


Multi-threaded Architectures Multi-core Processors

  The race for the multi-core:

  DELL PowerEdge C6145: Two four-sockets servers in a 2U chassis with Opteron 12-core processors =

96 cores!

Processor #cores f (GHz) L3 (MB) Power (W)

Intel Xeon X7560 8 2,66 24 130 AMD Opteron 12 2,5 12 105


Multi-threaded Architectures Multi-core Processors vs. Many-core GPUs

Performane

1980 2005 2015? 2025?? 2035???

Single thread performance More acMve transistors Higher frequency

Heterogeneous processors More acMve transistors Higher frequency

Specialized processors More acMve transistors Higher frequency

P. Hofstee, IBM AusMn

MulM-‐core processors More acMve transistors Higher frequency



  The race for the many-core:

  Tesla S2050: four Tesla C2070 in a 1U chassis = 1,792 cores!

Processor #cores f (GHz) Memory (GB)

Power (W)

Tesla C2070 (Fermi) 448 1,15 6 238 AMD Firestream 9270

800 ? 2 <220



  Intel Nehalem



  NVIDIA GT200   Building block: SM   8 scalar processors   1 DP FP unit   16 Kbytes of shared-memory   Register bank (16,384.64 Bytes)

ALU

Shared-‐memory

ALU

ALU ALU

ALU ALU

ALU ALU



  NVIDIA GT200   30 SM   240 SP “cores”



CPU

  Large caches   Few processing elements   Tuned for spatial/temporal locality   Sophisticated control

GPU

  Small caches   High number of processing elements   Tuned for sequential access to data

(streaming)

Control ALU

Cache

DRAM

ALU

ALU

ALU

DRAM

ALUs


Multi-threaded Architectures Present and Future Heterogeneity

  Heterogeneous platforms

CPU

RAM

GPU

GPU RAM PCI-‐E: 1-‐8GB/s

20-‐160 GB/s 6-‐30 GB/s

ALUs Cache 40 GB/s


Multi-threaded Architectures Present and Future Heterogeneity

  NVIDIA Denver (2011/05/01):

CPU with ARM instruction set integrated into the same chip as GPU

  AMD Fusion APU: x86 compatible CPU with programmable vector processing engines on a single chip


Index

1.  Multi-threaded architectures 40m? 2.  Dense linear algebra for multi-core

1.  Operations 2.  BLAS 3.  LAPACK 4.  High performance

3.  Programming in CUDA 4.  Dense linear algebra for GPUs 5.  Summary


Dense Linear Algebra for Multi-core Operations

  Dense linear algebra is at the bottom of the food chain for many scientific/engineering apps.

  Molecular dynamics simulations   Fast acoustic scattering problems   Dielectric polarization of nanostructures   Magneto-hydrodynamics   Macro-economics



  Radar cross-section problem (via BEM)

  Solve A x = b dense A → n x n n = hundreds of thousands of boundary points (or panels)



  Estimation of Earth’s gravity field

www.csr.utexas.edu/grace

  Solve y = H x0 + Є, dense H → m x n m = 66.000 observations n = 26.000 parameters for a model of resolution 250km



  Optimal cooling of steel profiles

  Solve AT X + X A – X S X

+ Q = 0,

dense A → n x n n = 5.177 for a mesh width of 6.91·10-3


Dense Linear Algebra for Multi-core BLAS (Basic Linear Algebra Subprograms)

  What is BLAS?   Routines that provide standard building blocks for

performing basic vector and matrix operations

  Why BLAS?   Scientific/engineering apps. often use these kernels.

Standardizing the BLAS interfaces allows:   Highly tuned, hardware-specific implementations of BLAS   Easy to recognize functions, less errors   Portability of scientific codes without sacrificing performance



  Implementations of BLAS? Multi-threaded?   IBM ESSL, Intel MKL, AMD ACML, NVIDIA CUBLAS   K. Goto’s GotoBLAS2, C. Whaley’s ATLAS

  Organization of BLAS   BLAS-1. O(n) flops on O(n) data: SCAL, AXPY, DOT,…   BLAS-2. O(n2) flops on O(n2) data: GEMV, TRSV,…   BLAS-3. O(n3) flops on O(n2) data: GEMM, SYRK, TRSM,…

  Functionality of BLAS   google: BLAS quick reference guide



  Example of BLAS-1: y := y + α x,

α ∈ ℜ, x,y ∈ ℜn

SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY) * .. Scalar Arguments .. DOUBLE PRECISION DA INTEGER INCX,INCY,N * .. * .. Array Arguments .. DOUBLE PRECISION DX(*),DY(*)



  Example of BLAS-2: y := β y + α A x,

α ,β ∈ ℜ, A ∈ ℜmxn, x ∈ ℜn, y ∈ ℜm

SUBROUTINE DGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) * .. Scalar Arguments .. DOUBLE PRECISION ALPHA,BETA INTEGER INCX,INCY,LDA,M,N CHARACTER TRANS * .. * .. Array Arguments .. DOUBLE PRECISION A(LDA,*),X(*),Y(*)



  Example of BLAS-3: C := β C + α A B, α, β ∈ ℜ, A ∈ ℜmxk, B ∈ ℜkxn, C ∈ ℜmxn

SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) * .. Scalar Arguments .. DOUBLE PRECISION ALPHA,BETA INTEGER K,LDA,LDB,LDC,M,N CHARACTER TRANSA,TRANSB * .. * .. Array Arguments .. DOUBLE PRECISION A(LDA,*),B(LDB,*),C(LDC,*)



  Data types:   Real/complex, single/double precision:

SGEMV, DGEMV, CGEMV, ZGEMV,…

  Matrix types   General, symmetric, triangular, band,…:

SGEMV, SSYMV, STRMV, SBDMV,…


Dense Linear Algebra for Multi-core LAPACK (Linear Algebra Package)

  Builds upon BLAS to provide more complex functionality   Systems of linear equations   Linear least-squares problems   Eigenvalues and singular values


Dense Linear Algebra for Multi-core High Performance

  Three main sources of inefficiency in the exploitation of multi-core processors:   Memory latency   Load imbalance   Serial bottlenecks

It also holds for GPUs!



  Memory latency The fundamental obstacle to high performance (executing useful computations at the rate at which the CPU can process) is the speed of memory: fetching and/or storing a data item from/to the memory requires more time than it takes to perform a flop with it

Small Registers

Cache

RAM Slow Large

Fast



  Impact of cache misses in performance:

laptop$ cat /proc/cpuinfo processor : 0 ... model name : Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz cpu cores : 2 ... laptop$ cat /proc/cpuinfo MemTotal: 3079028 kB ... google: "Intel Core 2 Duo P8400" -‐-‐> 3 MB L2 cache, 2.26 GHz, 1066 MHz FSB

  Peak performance is 2 cores × 2.26 ·109 cycles/sec. × 2 flops/cycle = 9.04 GFLOPS



>> time_gemv n 100 Time (secs.) 2.9543e-‐05 GFLOPS 6.7698e-‐01 n 200 Time (secs.) 5.9113e-‐05 GFLOPS 1.3533e+00 n 300 Time (secs.) 1.0537e-‐04 GFLOPS 1.7082e+00

Why only 2 GFLOPS?

Why the “hunch”?



Matrix-matrix product has the potential of hiding memory latency!

Opera&on flops memops flops/memops

Vector-‐vector

AXPY y := y + α x 2n 3n 2/3

Matrix-‐vector

GEMV y := β y + α A x 2n2 n2 2/1

Matrix-‐matrix

GEMM C := β C + α A B 2n3 4n2 n/2



  Naïve implementation: sequence of matrix-vector products

+=

C A B

… …

for (j=0; j<n; j++) for (i=0; i<n; i++) for (k=0; k<n; k++) C[i][j] += A[i][k] * B[k][j];



>> time_gemm n 100 Time (secs.) 2.9543e-‐05 GFLOPS 6.7698e-‐01 n 200 Time (secs.) 5.9113e-‐05 GFLOPS 1.3533e+00 n 300 Time (secs.) 1.0537e-‐04 GFLOPS 1.7082e+00

Mimics the performance of matrix-vector product!



  Blocked implementation:

Keep in cache during the whole sequence of updates!



>> time_gemm_blocked n 100 Time (secs.) 9.2465e-‐04 GFLOPS 2.1630e+00 n 200 Time (secs.) 7.2974e-‐03 GFLOPS 2.1926e+00 n 300 Time (secs.) 2.4632e-‐02 GFLOPS 2.1923e+00

Hides memory latency!



  Load imbalance and serial bottlenecks Distribute the computational load evenly among the cores and push forward the execution of the work in the critical path



  Cholesky factorization (LAPACK)

Key in the solution of s.p.d. linear systems A x = b ≡ (LLT)x = b L y = b ⇒ y LT x = y ⇒ x

A = * L LT



  Cholesky factorization. Blocked implementation:

A11 = L11 * L11T, A11:= L11L (Cholesky factorization)

F:

A21 := L21 = A21 * L11-T T:

A22 := A22 – L21 * L21T P:

Multi-threaded (MT) processor: Employ a MT implementation of T and P

1st iteration



  Cholesky factorization. n3/3 flops vs. n2 memops

…

1st iteration 2nd iteration 3rd iteration



  Cholesky factorization. Performance

Intel Xeon Quad-Core @ 2.3 GHz (8 cores)

MKL 10.1

Why? 71% peak

57% peak

80% peak



  Cholesky factorization. Sequential bottlenecks!

…




  Cholesky factorization. There is more parallelism!

1st iteration

Inside the same iteration

2nd iteration

In different iterations



  Exploit task-level parallelism dictated by data dep.

Dependencies among tasks define a tree

…

for (k=0; k<nb; k++){

Chol(A[k,k]);

for (i=k+1; i<nb; i++)

Trsm(A[k,k], A[i,k]); …

DAG: Dependency Acyclic Graph



  Run-time:   Identifies/extracts Task-level Parallelism (TLP)   Schedules tasks to execution   Maps tasks onto specific cores

ISS N0

N1

N2

→ …



  Cholesky factorization. Performance

AMD Dual-Core @ 2.0 GHz (16 cores)

MKL 10.1


Index

1.  Multi-threaded architectures 2.  Dense linear algebra for multi-core 40m? 3.  Programming in CUDA

1.  Preliminaries 2.  A simple kernel 3.  Parallel programming

4.  Dense linear algebra for GPUs 5.  Summary


Programming in CUDA Preliminaries

  The origins of CUDA   Beginning of 1990s: use of 2D display accelerators   1992: Silicon Graphics opens OpenGL   Mid-1990s: Boost of gaming market   2001: GeForce 3 with programmable vertex and pixel

shaders (DirectX)   2006: GeForce 8800 GTX, first GPU with CUDA


Programming in CUDA Preliminaries

  To use CUDA, you will need…   A CUDA-enabled GPU (Nov. 2006 GeForce 8800

GTX or more recent)   An NVIDIA device driver (allows user’s programs to

communicate with CUDA hardware)   A CUDA development kit (includes a compiler for

GPU code)   A standard C compiler (to compile CPU code)


Programming in CUDA A Simple Kernel

  Scalar addition: c := a + b, a,b,c ∈ ℜ

__global__ void scalar_addition( float a, float b, float *c ) { *c = a + b; }

int main( void ) { float a_h = 5.0, b_h = 3.0, c_h, *c_d;

cudaMalloc( (void**) &c_d, sizeof(float) ) ); scalar_addition<<<1,1>>>( a_h, b_h, c_d ); cudaMemcpy( &c_h, c_d, sizeof(float), cudaMemcpyDeviceToHost ); printf( “%g + %g = %g\n", a_h, b_h, c_h ); cudaFree( c_d );

return 0; }







return 0; }

Allocate memory for float number in

GPU

IdenMfies CUDA kernel (code to be executed on GPU)

Free memory for float number in

GPU







return 0; }

Invoke execuMon of kernel in GPU. Parameters transferred

automaMcally to device memory

Retrieve data back to host memory



scalar_addition.cu Compiler (nvcc, invoking naMve C

compiler)

CPU: scalar_addition.o GPU: scalar_addition.o

Linker

scalar_addition.x



  Interleaved execution: CPU, GPU, CPU,…

int main( void ) { Fragment of CPU code GPU_kernel1<<a,b>>(k1,k2,…); Fragment of CPU code GPU_kernel2<<c,d>>(i1,i2,…); Fragment of CPU code GPU_kernel3<<e,f>>(j1,j2,…); Fragment of CPU code ... }

CPU

GPU



  Summary:   CUDA C looks much like standard C (with extensions)   The GPU is passive. It does what the CPU (via a

kernel call) instructs it to do   The runtime takes care of transferring kernel

parameters to GPU memory   Memory must be allocated/deallocated in GPU

memory via appropriate calls   Do not dereference pointers to GPU memory in host!


Programming in CUDA Parallel programming

  Vector addition: vc := va + vb, va,vb,vc ∈ ℜn

#define N 256

int main( void ) { float va_h[N], vb_h[N], vc_h[N], a_d, b_d, c_d; // Initialize va_h, vb_h, vc_h…

// Allocate space for va_d, vb_d, vc_d in device memory cudaMalloc( (void**) &va_d, N*sizeof(float) ) );

// Transfer contents of va_h, vb_h, vc_h from host to device cudaMemcpy( va_d, va_h, N*sizeof(float), cudaMemcpyHostToDevice ); ...




... vector_addition<<<N,1>>>( va_d, vb_d, vc_d );

// Retrieve results back to host memory cudaMemcpy( vc_h, vc_d, N*sizeof(float), cudaMemcpyDeviceToHost );

for (i=0; i<N; i++) printf( “%g + %g = %g\n", va_h[i], vb_h[i], vc_h[i] );

// Free memory in device for va_d, vb_d, vc_d cudaFree( va_d ); ... return 0; }

N copies of the kernel are executed

on the GPU!

Goal: each copy adds two

components of the vectors




  Data parallelism or SIMD programming model!

__global__ void vector_addition( float *va, float *vb, float *vc ) { int tid = blockIdx.x; if (tid < N)

vc[tid] = va[tid] + vb[tid]; }

Built-‐in variable of CUDA runMme

Index of CUDA thread running this

kernel



  CPU vs. GPU (CUDA)

void vector_addition( float va[], float vb[], float vc[] ) { int i; for (i=0; i<N; i++) vc[i] = va[i] + vb[i]; }

__global__ void vector_addition( float *va, float *vb, float *vc ) { int tid = blockIdx.x; if (tid < N)

vc[tid] = va[tid] + vb[tid]; }



  Summary:   A CUDA kernel is executed by an array of threads

  All threads execute the same code   Each thread has a unique identifier (blockIdx.x,…)   Creating/destroying threads is extremely cheap

0 1 2 3 4 5 6 7

int tid = blockIdx.x; if (tid < N) vc[tid] = va[tid] + vb[tid];

0 1 2 3 4 5 6 7

KERNEL



  Threads are organized hierarchically   Threads are grouped into blocks: threadIdx, blockDim   Blocks are grouped into grids: gridDim, blockIdx

Block (0,0) Block (1,0) Block (2,0)

Block (0,1) Block (1,1) Block (2,1)

Grid 1

Thread (0,0)

Thread (1,0)

Thread (2,0)

Thread (3,0)

Thread (0,1)

Thread (1,2)

Thread (2,1)

Thread (3,1)

Thread (0,2)

Thread (1,2)

Thread (2,2)

Thread (3,2)

Block (2,0)

blockDim.x

gridDim.x

gridDim

.y

blockD

im.y



  Exercise. Matrix addition: MC := MA + MB, MA,MB,MC ∈ ℜnxn

#define M 256 #define N 16

int main( void ) { … matrix_addition<<<???,???>>>( MA_d, MB_d, MC_d ); }

__global__ void matrix_addition( float *MA, float *MB, float *MC ) { ??? }



  All kernel invocations are asynchronous   Control returns to CPU before the kernel is completed

  Explicit blocking cudaThreadSynchronize();

  Memory copies are synchronous, though there are also asynchronous versions (overlap computation and communication!)



  Developing efficient CUDA code is far from trivial!   Memory is controlled by user and is key to high

performance   Use fast shared memory   Coalesced accesses   Avoid conflicts

  Avoid frequent transfer of small messages through PCI-E   Overlap computation and communication   Maximize flops/memops ratio   Avoid branching in CUDA code   Etc.


Index

1.  Multi-threaded architectures 40m? 2.  Dense linear algebra for multi-core 40m? 3.  Programming in CUDA 40m?

4.  Dense linear algebra for GPUs 1.  Preliminaries 2.  CUBLAS 3.  Building on top of CUBLAS 4.  Multi-GPU platforms

5.  Summary


Dense Linear Algebra for GPUs Preliminaries

  Matrices are logically viewed as a 2-D data structure, but in memory they are stored in 1-D, by rows or by columns:

  Because Fortran-77 was the standard language to program scientific codes for many decades, numerical libraries assume storage in column major order (Fortran-like). Be careful when using these libraries from C!



  All parameters passed by reference:

  Invoked from C program:

SUBROUTINE SGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) * .. Scalar Arguments .. REAL ALPHA,BETA INTEGER INCX,INCY,LDA,M,N CHARACTER TRANS * .. * .. Array Arguments .. REAL A(LDA,*),X(*),Y(*)

int m, n, incx, incy, lda; float alpha, beta, x, y, A;

// Initializations: m, n, alpha, beta, A, lda, x, y, incx, incy sgemv(”No transpose”, &m, &n, &alpha, A, &lda, x, &incx, &beta, y, &incy );



  BLAS assumes matrices stored in column major order, but C compilers generate row major order

  Fortran indices start at 1, but C at 0   A useful trick: store Fortran-like structures in C vectors:

#define Aref ( a1, a2 ) A[ (a1−1)*Alda + (a2−1) ] #define xref ( a ) x[ (a−1) ] #define yref ( a ) y[ (a−1) ]

float A[MN], x[N], y[M]; int Alda = M; … for (i=1; i<=M; i++){ tmp = yref( i ) ; for (j =1; j<=N; j++) tmp += Aref( i, j ) xref ( j ) ; yref( i ) = tmp ; }


Dense Linear Algebra for GPUs CUBLAS

  Programming dense linear algebra libraries on heterogeneous CPU+GPU architectures   Multiple address spaces without hardware coherence (as difficult

as message-passing)   Scheduling on heterogeneous resources (also much harder)   Possibly, more than one accelerator per node   Take advantage of single precision speed-up: iterative

refinement



  Requires initialization/termination: #include "cublas.h"

cublasStatus cublasInit(); enum cublasStatus CUBLAS_STATUS_SUCCESS CUBLAS_STATUS_NOT_INITIALIZED CUBLAS CUBLAS_STATUS_ALLOC_FAILED CUBLAS_STATUS_INVALID_VALUE CUBLAS_STATUS_ARCH_MISMATCH CUBLAS_STATUS_MAPPING_ERROR CUBLAS_STATUS_EXECUTION_FAILED CUBLAS_STATUS_INTERNAL_ERROR

cublasStatus cublasShutdown();



  Other interesting routines (wrappers in fortran.c): int CUBLAS_SET_MATRIX( const int *rows, const int *cols, const int *elemSize, const void *A, const int *lda, const devptr_t *B, const int *ldb ) int CUBLAS_GET_MATRIX( ... )

int CUBLAS_SET_VECTOR( ... )

int CUBLAS_GET_VECTOR( ... )



  Same functionality and interface as standard BLAS

  No need to pass parameters by reference!

SUBROUTINE SGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) * .. Scalar Arguments .. REAL ALPHA,BETA INTEGER INCX,INCY,LDA,M,N CHARACTER TRANS * .. * .. Array Arguments .. REAL A(LDA,*),X(*),Y(*)

void cublasSgemv( char trans, int m, int n, float alpha, const float *A, int lda, const float *x, int incx, float beta, float *y, int incy);



  axpy: y := y + α x, α ∈ ℜ, x,y ∈ ℜn

#include “cublas.h” #define N 256

int main( void ) { float x_h[N], y_h[N], x_d[N], y_d[N], alpha;

// Initialize x_h, y_h, alpha and CUBLAS environment cublasInit();

// Allocate space for x_d, y_d in device memory cudaMalloc( (void**) &x_d, N*sizeof(float) ) ); cudaMalloc( (void**) &y_d, N*sizeof(float) ) );

// Transfer contents of x_h, y_h from host to device

cudaMemcpy( x_d, x_h, N*sizeof(float), cudaMemcpyHostToDevice ); cudaMemcpy( y_d, y_h, N*sizeof(float), cudaMemcpyHostToDevice ); ...



  axpy: y := y + α x, α ∈ ℜ, x,y ∈ ℜn

... // Invoke CUBLAS kernel cublasSaxpy( n, alpha, x_d, 1, y_d, 1 );

// Retrieve results back to host memory cudaMemcpy( y_h, y_d, N*sizeof(float), cudaMemcpyDeviceToHost );

for (i=0; i<N; i++) printf( “%y(%d) = %g\n", i, y[i] );

// Free memory in device for x_d, y_d; Destroy CUBLAS environment cudaFree( x_d );

cudaFree( y_d ); cublasStatus cublasShutdown();

return 0; }







  Insights:   Cost of invoking a CUBLAS kernel is not zero! Take

into account for BLAS-1 or small problem dimensions   Cost of transferring data/results is not zero! Take into

account for BLAS-1 or small problem dimensions   In NVIDIA Fermi generation, GeForce has more cores

than Tesla, but less memory and reliability


Dense Linear Algebra for GPUs Building on Top of CUBLAS

  Cholesky factorization (LAPACK)

Key in the solution of s.p.d. linear systems A x = b ≡ (LLT)x = b L y = b ⇒ y LT x = y ⇒ x

A = * L LT



  Cholesky factorization. Unblocked code:

a11 := l11 = a111/2 (square root) R:

a21 := l21 = a21 / l11 SCAL:

A22 := A22 – l21 * l21T SYR:

1st iteration



  Cholesky factorization. Unblocked code:

…




  Cholesky factorization. CPU unblocked code: SUBROUTINE CPU_SPOTF2( UPLO, N, A, LDA, INFO ) ... DO J = 1, N * AJJ = A( J, J ) A( J, J ) = SQRT( AJJ )

* IF( J.LT.N ) THEN CALL SSCAL( N-‐J, ONE / AJJ, A( J+1, J ), 1 ) CALL SSYR( 'Lower', N-‐J, -‐ONE, A( J+1, J ), 1, $ A( J+1, J+1 ), LDA ) END IF

END DO



  Cholesky factorization. GPU unblocked code: #define SIZEOF_REAL 4 #define IDX2F(I,J,LD) ((((J)-‐1)*(LD))+((I)-‐1))*SIZEOF_REAL SUBROUTINE GPU_SPOTF2( UPLO, N, DEVPTRA, LDA, INFO ) ... DO J = 1, N CALL CUBLAS_GET_MATRIX( 1, 1, SIZEOF_REAL,

$ DEVPTRA+IDX2F(J,J,LDA), LDA, AJJ, 1 ) A( J, J ) = SQRT( AJJ ) CALL CUBLAS_SET_MATRIX( 1, 1, SIZEOF_REAL, $ AJJ, 1, DEVPTRA+IDX2F(J,J,LDA), LDA ) IF( J.LT.N ) THEN CALL CUBLAS_SSCAL( N-‐J, ONE / AJJ, DEVPTRA+IDX2F(J+1,J,LDA), 1 )

CALL CUBLAS_SSYR( 'Lower', N-‐J, $ -‐ONE, DEVPTRA+IDX2F(J+1,J,LDA), 1, $ DEVPTRA+IDX2F(J+1,J+1,LDA), LDA ) END IF END DO


Dense Linear Algebra for Multi-core Building on Top of CUBLAS

  Cholesky factorization. Blocked code:

A11 = L11 * L11T, A11:= L11L (Cholesky factorization)

F:

A21 := L21 = A21 * L11-T T:

A22 := A22 – L21 * L21T P:

1st iteration



  Cholesky factorization. Blocked code:

…




  Cholesky factorization. CPU blocked code: SUBROUTINE CPU_SPOTRF( UPLO, N, A, LDA, INFO ) ... DO J = 1, NB, N * JB = MIN( NB, N-‐J+1 ) CALL CPU_SPOTF2( 'Lower', JB, A(J,J), LDA, INFO )

* IF( J+JB.LT.N ) THEN CALL STRSM( 'Right', 'Lower', 'Transpose', 'Non-‐unit', $ N-‐J-‐JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) CALL SSYRK( 'Lower', 'No transpose',

$ N-‐J-‐JB+1, JB, -‐ONE, $ A( J+JB, J ), LDA, ONE, $ A( J+JB, J+JB ), LDA ) END IF END DO



  Cholesky factorization. GPU blocked code (1): SUBROUTINE GPU_SPOTRF_VAR1( UPLO, N, A, LDA, INFO ) ... DO J = 1, NB, N JB = MIN( NB, N-‐J+1 ) CALL GPU_SPOTF2 ( 'Lower', JB, DEVPTRA+IDX2F(J,J,LDA), LDA, INFO ) IF( J+JB.LT.N ) THEN

CALL CUBLAS_STRSM( 'Right', 'Lower', ‘Transpose', 'Non-‐unit', $ N-‐J-‐JB+1, JB, $ ONE, DEVPTRA+IDX2F(J,J,LDA), LDA, $ DEVPTRA+IDX2F(J+JB, J,LDA), LDA ) CALL CUBLAS_SSYRK( 'Lower', 'No transpose', N-‐J-‐JB+1, JB, $ -‐ONE, DEVPTRA+IDX2F(J+JB,J,LDA), LDA,

$ ONE, DEVPTRA+IDX2F(J+JB,J+JB,LDA), LDA ) END IF END DO



  Cholesky factorization. GPU blocked code (2): SUBROUTINE GPU_SPOTRF_VAR2( UPLO, N, A, LDA, INFO ) ... FLOAT WORK( NBMAX*NBMAX ) ... DO J = 1, N JB = MIN( NB, N-‐J+1 )

CALL CUBLAS_GET_MATRIX( JB, JB, SIZEOF_REAL, $ DEVPTRA+IDX2F(J,J,LDA), LDA, WORK, JB ) CALL CPU_SPOTRF( ‘Lower’, JB, WORK, JB, INFO ) CALL CUBLAS_SET_MATRIX( JB, JB, SIZEOF_REAL, $ WORK, JB, DEVPTRA+IDX2F(J,J,LDA), LDA ) IF( J+JB.LT.N ) THEN

CALL CUBLAS_STRSM( ... ) CALL CUBLAS_SSYRK( ... ) END IF END DO




Dense Linear Algebra for GPUs Multi-GPU Platforms

  How do we program these?

View as a…   Shared-memory multiprocessor + DSM

CPU(s) PCI-‐e bus

GPU #1

GPU #2

GPU #3

GPU #4



  Software Distributed-Shared Memory (DSM)   Software: flexibility vs. efficiency   Underlying distributed memory hidden from the users   Reduce memory transfers using write-back, write-

invalidate,…   Well-known approach, not too efficient as a

middleware for general apps.

  Regularity of dense linear algebra operations makes a difference!



  Data transfer   Before execution, transfer data to

device

  Upon completion, retrieve results back to host

→ poor data locality

Multi-GPU platform

→



  Shared memory system

Multi-core processor:

MP P0+C0

P1+C1

P2+C2

P3+C3

Multi-GPU platform



  Reduce #data transfers   Software cache in devices:

  Operate at block level

  Software → flexibility

  Write-back

  Write-invalidate

Multi-GPU platform

→




Index

1.  Multi-threaded architectures 40m? 2.  Dense linear algebra for multi-core 40m? 3.  Programming in CUDA 40m?

4.  Dense linear algebra for GPUs 5.  Summary


Summary (Wrap up) More FLOPS!

System Rmax (TFLOPS)

1 Tianhe-‐1A -‐ NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-‐1000

2,566*

2 Jaguar -‐ Cray XT5-‐HE Opteron 6-‐core 2.6 GHz

1,759

3 Nebulae -‐ Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPU

1,271 4 TSUBAME 2.0 -‐ HP ProLiant SL390s

G7 Xeon 6C X5670, Nvidia GPU 1,192

5 Hopper -‐ Cray XE6 12-‐core 2.1 GHz 1,054

*1 día Tianhe-‐1A = 95 years of all the people in the world (6.000 millions) with a calculator


Summary (Wrap up) More Cores!

PFLOPS (1015 flops/sec.)

2010 JUGENE   109 core level 3.4 GFLOPS

  101 node level Quad-Core

  105 cluster level 73,728 nodes

EFLOPS (1018 flops/sec.)

  109.5 core level

  103 node level!

  105.5 cluster level

GPUs?


Summary (Wrap up) Better Energy Efficiency!

System Top500 Rmax (TFLOPS)

Green500 Power (KW)

MFLOPS/W W to EFLOPS? (MW)

Tianhe-‐1A 1 2,566 11 4,040.00 635.15 1,574.40

IBM TJ Watson Blue Gene/Q

115 65.35 1 38.80 1,684,20 593.75

Most powerful reactor under construcMon in France Flamanville: 1,630 MWe


Thank you!

Questions?

Documents

Modern-Linear-AlgebraLibraries- for-Graphics-Processors- Webseite/pdf/EU Regional School... · RWTH Aachen, March2011-HPC&A-–UJI--11-Multi-threaded Architectures Evolution in Processor