6
Notes on Homework 1

Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Embed Size (px)

Citation preview

Page 1: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Notes on Homework 1

Page 2: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

2x2 Matrix Multiply

C00 += A00B00 + A01B10

C10 += A10B00 + A11B10

C01 += A00B01 + A01B11

C11 += A10B01 + A11B11

Rewrite as SIMD algebra

C00_C10 += A00_A10 * B00_B00C01_C11 += A00_A10 * B01_B01C00_C10 += A01_A11 * B10_B10C01_C11 += A01_A11 * B11_B11

02/11/2009CS267 Lecture 7

Page 3: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Summary of SSE intrinsics

#include <emmintrin.h>

Vector data type:• __m128d

Load and store operations:• _mm_load_pd• _mm_store_pd• _mm_loadu_pd• _mm_storeu_pd

Load and broadcast across vector• _mm_load1_pd

Arithmetic:• _mm_add_pd• _mm_mul_pd

Page 4: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Example: multiplying 2x2 matrices

#include <emmintrin.h>c1 = _mm_loadu_pd( C+0*lda ); //load unaligned block in Cc2 = _mm_loadu_pd( C+1*lda );for( int i = 0; i < 2; i++ ){ a = _mm_load_pd( A+i*lda ); //load aligned i-th column of A b1 = _mm_load1_pd( B+i+0*lda ); //load i-th row of B b2 = _mm_load1_pd( B+i+1*lda ); c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) );}_mm_storeu_pd( C+0*lda, c1 ); //store unaligned block in C_mm_storeu_pd( C+1*lda, c2 );

Page 5: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

General suggestions for optimizations• Changing the order of the loops (i, j, k)

• changes which matrix stays in memory and the type of product• Blocking for multiple levels (registers or SIMD, L1, L2)

• L3 is not important as most matrices will fit and optimizations not likely to bring more benefits

• Unrolling the loops• Copy optimizations

• will be a large benefit if done for sub-matrix that is kept in memory• careful not to overdo (overhead is high for every level)

• Aligning memory and padding• can help with SIMD performance• eliminating special cases and bad performance

• Adding SIMD intrinsics• cannot get over 50% without explicit intrinsics or auto-vectorization

• Tuning parameters• various block sizes, register sizes (perhaps automate)

Page 6: Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Other issues• Checking compilers

• options available PGI, PathScale, GNU, Cray• Cray compiler seems to have an issue with explicit intrinsics

• Using compiler flags• remember not allowed to use MM specific flags (-matmul)

• Checking assembly code for SSE instructions • use the –S flag to see assembly• should have mainly ADDPD & MULPD ops, ADDSD & MULSD are

scalar computations• Remember the write-up

• want to know what optimizations you tried and what worked and failed

• try and have an incremental design and show the performance of multiple iterations

• DUE DATE changed, now due Monday Feb 17th at 11:59pm