SAR Benchmark tdFIR Filter Revision r1 Craig Lund email clund at localk.com December 3, 2012 Created under contract to BittWare of Concord NH

SAR BenchmarktdFIR FilterRevision r1Craig Lund email clund at localk.comDecember 3, 2012

Created under contract to BittWare of Concord NH www.bittware.comCopyright 2012 BittWare, Inc.

Slide 2

We are going to show various implementations of a simple FIR filter. This algorithm is also called direct convolution. Expressed in MATLAB, it is:

OutputArray = conv (InputArray, FilterArray);

Most DSP libraries contain an optimized implementation.

It requires 8*N*K FLOPS to execute for each m.

𝑦𝑚 [𝑖 ]=∑𝑘=0

𝐾−1

𝑥𝑚 [ 𝑖−𝑘 ]𝑤𝑚 [𝑘 ] for 𝑖 = 0,1,…,𝑁−1𝑥𝑚 is the InputArray of length 𝑁𝑤𝑚 is the FilterArray of length 𝐾𝑦𝑚 is the OutputArray of length 𝑁 + 𝐾𝑚 is the count of a stream of 𝑀 such filters

FIR Filter

Slide 3

𝑤[1] 𝑤[𝐾-1]𝑤[0]

𝑥[𝑖-1] 𝑥[𝑖]𝑥[𝑖-𝐾]

𝑦[𝑖]

multiplication

summation

FilterArrayK=3 in this diagram

Some people prefer a diagram. This is a FIR filter (direct convolution).

The first and last (K-1) elements of the OutputArray are slightly different (not illustrated here, see equation). Some FIR implementations simply toss out those elements. Ours does not.

InputArray

OutputArray

K-1 extra elements in the OutputArray

i=0 i=N-1

Slide 4

We need some real-world data sizes if we are going to show results. For that we turn to the RADAR-oriented, HPEC Challenge benchmark suite. See the TDFIR kernel http://www.ll.mit.edu/HPECchallenge/tdfir.html

Their benchmark offers two datasets. The timing data we present in later slides uses Data Set 1.

Note that you won’t find direct convolution implementations optimized for giant arrays. It is faster to use an FFTalgorithm in that case. Using an FFT for that is benchmarked using hpecChallenge’s FDFIR kernel.

Parameter Description Data Set 1 Data Set 2

M Number of filters to stream 64 20

N Length of InputArray in complex elements

4096 1024

K Length of FilterArray in complex elements

128 12

Workload in MFLOP 268.44 1.97

http://www.ll.mit.edu/HPECchallenge/tdfir.html

http://www.ll.mit.edu/HPECchallenge/tdfir.html

Slide 5

% TDFIR comes with a file containing separate input data for each filter.% The input data provided for each filter is exactly the same.% The benchmark implementation ignores that fact and reads in all the copies.

% For this illustration we assume the data for each filter was read into rows % and a separate row used for each filter

for f=1:M OutputStream(f,:) = conv (InputStream(f,:), FilterStream(f,:));end

This slide illustrates how the HPEC Challenge tdFIR benchmark uses the data that it supplies, expressed in MATLAB.

In words, the benchmark presents a batch of M separate filters to compute.

Using HPEC Challenge Data

Slide 6

#include <complex.h>#include <tdFIR.h>void tdFIR ( unsigned short const N; // Size of input in number of complex elements unsigned short const K; // Size of filter in number of complex elements float complex const * const restrict InputArray, // Length N float complex const * const restrict FilterArray, // Length K float complex * const restrict OutputArray, // Length N+K-1 ){

// Code examples go here

}

This slide presents our example function’s definition using C. We show it here and don’t duplicate the definition in the slides that follow.

We created this definition. This specific argument list isn’t defined in the benchmark.

Declarations

Slide 7

// Loop assumes OutputArray starts out full of zeros.for ( int FilterElement = 0; FilterElement < K; FilterElement++ ){ for ( int InputElement = 0; InputElement < N; InputElement++ ) { OutputArray [ InputElement + FilterElement ] += InputArray [ InputElement ] * FilterArray[ FilterElement ]; }}

The obvious implementation in C is very simple. This implementation is often “good enough”. Don’t study it. Just squint your eyes to get a feeling for the complexity.

This implementation exactly matches the equation shown on slide 2.

Upcoming slides offer alternative implementations which add complexity to improve performance.

The code above is available in the repository as “scalar/simple.c”.

Simple Implementation

Slide 8

// Process first K-1 elements differently// Code hidden

// This is the loop to optimize as it does most of the workfor (int InputElement = (K-1); InputElement < N; InputElement++){ float complex sum = 0.0f; for (int FilterElement = 0; FilterElement < K; FilterElement++) { sum += InputArray[InputElement - FilterElement] * FilterArray[FilterElement]; } OutputArray[InputElement] = sum;}

// Process last K-1 elements differently// Code hidden

Here we swap the order of the loops. Making this change brings a little extra performance in some cases (the memory access pattern is friendlier to caches and you don’t need to zero OutputArray). The tradeoff is more instructions (larger text segment).

The code above is available in the repository as “scalar/swap.c”.

Swap Loop Order

Slide 9

Here is proof run on an old (2008) Intel Xeon E5462 at 2.8 GHz.

gcc version 4.7.2 run on Ubuntu 12.10-march=native -std=gnu99 -Ofast -funroll-loops -fno-tree-vectorize

All of the results shown in this slide presentation are available as raw data in the repository in the spreadsheet “results.xlsx”.

Performance

Implementation Performance Lines of Code Binary Text Size % of core’s peak GFLOPS

Simple Our baseline Our baseline Our baseline 16%

Swap Loop Order 123% faster 173% more lines 108% more memory 20%

Slide 10

static inline void tdFIRwork ( // parameters not shown ){ float sumReal = 0.0f, sumImag = 0.0f; for ( ; FilterCount !=0 ; FilterCount-- ) { // Load this filter element into registers const float FilterReal = *(float *)FilterElement; const float FilterImag = *(float *)(FilterElement + sizeof(float)); // Load this input element into registers const float InputReal = *(float *)InputElement; const float InputImag = *(float *)(InputElement + sizeof(float)); // Complex multiply sumReal += InputReal*FilterReal - InputImag*FilterImag; sumImag += InputReal*FilterImag + InputImag*FilterReal; // Update pointers for next round FilterElement += sizeof(float complex); InputElement -= sizeof(float complex); }; // Store the sum *(float complex *)OutputElement = sumReal + I*sumImag; return;}

Sometimes compilers generate better code if you avoid the “complex” type and if you use pointers instead of array indexing. Our resulting code uses an inline “tdFIRwork” loop to do the math.

Even if it isn’t faster, we need this formulation for upcoming implementations using SIMD and OpenCL which do not support float complex.

The code above is available in the repository as “scalar/pointers.c”.

Drop Complex, Use Pointers

Slide 11

static inline void tdFIRwork ( // parameters not shown ){ float complex sum; __m128 sumSIMD = _mm_setzero_ps(); unsigned short FilterDoubleCount = FilterCount / 2; // Start with a loop manually unrolled for ( ; FilterDoubleCount != 0; FilterDoubleCount-- ) { // Load this filter and input element into registers const __m128 FilterPair = _mm_loadu_ps ((float *)FilterElement); __m128 InputPair = _mm_loadu_ps ((float *)(InputElement-sizeof(float complex))); // The scalar algorithm loads the elements in the opposite order InputPair = _mm_shuffle_ps (InputPair, InputPair, _MM_SHUFFLE(1,0,3,2)); // Complex multiply sumSIMD = _mm_add_ps (sumSIMD, _mm_addsub_ps ( _mm_mul_ps ( InputPair, _mm_moveldup_ps ( FilterPair ) ), _mm_mul_ps ( _mm_shuffle_ps (InputPair, InputPair, _MM_SHUFFLE(2,3,0,1)), _mm_movehdup_ps ( FilterPair ) ) )); FilterElement += 2*sizeof(float complex); InputElement -= 2*sizeof(float complex); }; // Next line results in {r1+r2, i1+i2, r2+r1, i2+i1) sumSIMD = _mm_add_ps (sumSIMD, _mm_shuffle_ps (sumSIMD, sumSIMD, _MM_SHUFFLE(1,0,3,2))); _mm_storel_pi((__m64 *)&sum,sumSIMD); // If FilterCount was odd, we need an extra, pass if ( (FilterCount % 2) != 0 ) // Code not shown but it is similar to the code on the previous slide}

If your processor supports SIMD instructions, don’t assume that your compiler can generate them. The tdFIRwork loop the previous page did not vectorize automatically. The implementation above uses Intel’s SSE3 intrinsics (a little old but widely deployed).

The code above is available in the repository as “scalar/sse3.c”.

Use SIMD Extensions

Slide 12

When we created the “pointers.c” implementation it ran faster than the “swap.c” implementation. A compiler update changed that. It is now the same speed.

“simple.c” is the only tdFIR implementation on this list from which the compiler can automatically generate SIMD instructions.

Performance

Implementation Performance Lines of Code Binary Text Size % of core’s peak GFLOPS

Use Pointers 123% faster 373% more lines 108% more memory 20%

Hand SIMD 170% faster 462% more lines 108% more memory 28%

Automatic SIMD 128% faster 0% more lines 168% more memory 21%

Slide 13

Modern processors have multiple cores. We can use more than one core to speed up our tdFIR. However, we must first decide if we want the extra cores to make a single tdFIR complete faster (optimize for turnaround) or if we want to finish the complete set of tdFIR filters sooner (optimize for throughput).

Yes, you can create an implementation that does both but that is rational only if you have lots of cores available (hundreds or more which some GPUs will have).

Multiple Cores


“Turnaround” means speeding up this function

“Throughput”

means speeding

up this loop

Slide 14

The tdFIR function contains no data dependencies that interfere with the obvious options for parallelizing the loops. This function is thus almost “embarrassingly parallel” which means that our examples will not illustrate synchronization. We say “almost” because there is an important data dependence issue that we discuss when we introduce OpenCL implementations.

Data Dependencies


Slide 15

These implementations are getting too complicated to illustrate with code snippets on a slide. Jump ahead to the final slide to learn how to download all of the code referenced in this slide presentation.

Parallel Implementations

Slide 16

We are running this on an 8 core system. The scalar code used within each thread was our hand-optimized SIMD implementation.

Our throughput implementation generates MTHREADS threads that run to completion. The parent thread handles any extra filters (extraM = M % MTHREADS).

This throughput change added complexity to the loop which calls tdFIR (threads/MtdFIR.c). That code grew from 24 lines to 77 lines, or by 321%.

Threads for Throughput

Threads Used

Performance Binary Text Size

% of systems’ peak GFLOPS

2 307% faster than baseline, 180% faster than scalar sse3.c

173% more memory than baseline

6% (note 6 cores are sitting idle here but counted in the system GFLOPS)

4 577% faster than baseline,338% faster than scalar sse3.c

12%

8 895% faster than baseline,526% faster than scalar sse3.c

18% (down from 28% on a single core but lots faster overall)

Slide 17

Our turnaround version is less efficient that the throughput version. This is because it generates MTHREADS for each filter which run just for the duration of the filter. Thus the turnaround version generates M times more threads than the throughput version. After MTHREADS=6 it gets slower*.

This turnaround change added complexity to the sse3.c code. The revised implementation is in “threads/sse3threads.c”. The turnaround code grew that implementation from 120 lines to 182 lines, or by 152% (or 700% larger than the scalar simple.c baseline).* A better implementation keeps the same threads running and then sends work to them as each filter is processed. That

change is on our “to do” list because the result might be the best starting point for a fast OpenCL implementation.

Threads for Turnaround

Threads Used

Performance Binary Text Size

% of systems’ peak GFLOPS

2 253% faster than baseline, 149% faster than scalar sse3.c83% of throughput speed

177% more memory than baseline

6% (note 6 cores are sitting idle here but counted in the system GFLOPS)

4 406% faster than baseline,238% faster than scalar sse3.c70% of throughput speed

12%

8 391% faster than baseline,230% faster than scalar sse3.c44% of throughput speed

18% (down from 28% on a single core but lots faster overall)

Slide 18

OpenCL works best when all the loops are pushed into the runtime and the computation kernel is thereby reduced as much as possible. Doing this for tdFIR results in the kernel presented above. Unfortunately, it won’t work.

The problem is a race condition between kernels operating simultaneously on the same OutputArray element. Multiple kernels can read the same intermediate sum as their starting value.

OpenCL has “atomic” memory calls that handle this situation.

OpenCL Data Dependence

size_t MElement = get_global_id(0);size_t InputElement = get_global_id(1);size_t FilterElement = get_global_id(2);

#define OutputBase MElement*(N+K-1)#define InputBase MElement*N#define FilterBase MElement*K

OutputArray[OutputBase+InputElement+FilterElement] +=

cmult( InputArray[InputBase+InputElement], FilterArray[FilterBase+FilterElement]);

Slide 19

OpenCL doesn’t necessarily increase overhead. OpenCL on an SMP server with native kernels and memory mapped objects will resemble our threaded implementations. However, OpenCL is usually associated with PCIe attached acceleration hardware. The DMA of code and data over PCIe consumes time. The “nanoseconds” column above assumes PCIe Gen 2 at 16x and performed as a single DMA. That transfer time represents a 7% increase to our best time. However, the actual transfers will require multiple DMA operations.

OpenCL Overhead

Dataset M N K Input Bytes

Filter Bytes

Output Bytes

Text Size Total MB transferred

Nano-seconds

Large 64 4096 128 2097152 65536

2162176 Less than 24024

4.35 MB 543611

Small 20 1024 12 163840 1920 165600 Same 0.36 MB 44423

Slide 20

OpenCL GPU Results MissingOur GPU results are not yet optimal.Also, we have not yet started to explore optimal implementations for FPGA implementation.

Slide 21

It is easy to write portable OpenCL code. However, it is impossible to write high-performance OpenCL code that targets multiple accelerator types.

The various acceleration architectures are wildly different (multicore CPU, GPU, FPGA, Adapteva, etc.). High-performance implementations will vary just as greatly.

Portable?

Slide 22

These slides and the underlying code are available from a public access Subversion repository. The Windows client for Subversion is available from http://tortoisesvn.net/

The Linux command to use our repository is:

svn checkout http://svn.sarbenchmark.com/tdFIR/trunk tdFIR

If you type the URL above into any web browser you can see our code.

Code Repository

http://tortoisesvn.net/

http://tortoisesvn.net/

http://svn.sarbenchmark.com/tdFIR/trunk

http://svn.sarbenchmark.com/tdFIR/trunk

Documents

SAR Benchmark tdFIR Filter Revision r1 Craig Lund email clund at localk.com December 3, 2012 Created under contract to BittWare of Concord NH