Introduction to SIMD Programming

Accelerating Real-time Program Performance

Koray Hagen

An Introduction to SIMD and Hardware Intrinsics

Prerequisite Knowledge

1. Exposure to C++

2. Exposure to an assembly language

And that’s it.

The Agenda

1. Terminology and an introduction to SIMD

2. Examples and benefits of SIMD Programming in C++

3. Some tradeoffs, caveats, and insights

4. Closing thoughts and further reading

Instructions and DataClassification of computer architectures

Terminology and Definitions

1. Concurrent – Events or processes which appear to occur or progress at the same time

1. Related terms often heard:1. Threads2. Mutexes3. Semaphores4. Monitors

2. Parallel – Events or processes which do occur or progress at the same time1. Parallel programming is not the same as concurrent programming, they are distinct2. It refers to techniques that provides parallel execution of operations

SIMD is a computer architecture that makes use of parallel execution

Flynn’s Taxonomy

A classification of computer architectures that was proposed by Michael Flynn in 1966

1.Characterized by a few criteria1. Parallelism exhibited with its instruction streams2. Parallelism exhibited with its data streams

2.Four distinct classifications1. SISD2. SIMD3. MISD4. MIMD

The instruction stream and data stream can both be either single (S) or multiple (M)

SISD – Single Instruction Single Data

1. Single CPU systems1. Specifically uniprocessors 2. Co-processors do not count.

2. Concurrent processing1. Pipelined execution2. Prefetching

3. Concurrent Execution1. Independent concurrent tasks can execute

different sequences of operations

4. Examples1. Personal computers

SIMD – Single Instruction Multiple Data

1. One instruction stream is broadcasted to all processors

1. In reality, that processor will be simple, such as an ALU (Arithmetic Logic Unit)

2. All active processors execute the same instructions synchronously, but on different data

3. The data items are aligned in an array (or vector)

4. An instruction can act on the complete array in one cycle

5. Examples later on

MISD – Multiple Instruction Single Data

1. Uncommon architecture type that is used for specialized purposes

1. Redundancy2. Fault Tolerance3. Task replication

2. Example1. Space shuttle flight control computer

MIMD – Multiple Instruction Multiple Data

1. Processors are asynchronous1. Have the ability to independently execute on

different data sets

2. Communication is handled by either1. Shared or distributed memory2. Message passing

3. Examples1. Computer aided manufacturing2. Simulation3. Modeling4. Communication switches

The importance of SIMD

1. SIMD intrinsics are supported by most modern CPU’s in commercial use1. As time passes, that support is ever increasing

2. Intel Haswell (2013) microarchitecture supports MMX, SSE 1 - SSE 4.2, AVX - AVX2, etc.3. Intel Pentium (1996) microarchitecture supported MMX for the first time

2. SIMD as an architecture is conceptually easier to leverage, understand, and debug for programmers

1. It bears large similarities to sequential and concurrent programming

3. SIMD usage has been successful in multimedia application for decades1. Instruction and data level parallelism have many real-world applications

SIMD ProgrammingC++ parallel computation examples

But first, some additional information

1. I will be showing two real-world examples for this portion1. The first example is simple and will showcase SISD2. The second example is simple and will showcase SIMD

2. All programs were written in C++ on Windows 8.1, using Visual Studio 2013

3. Visual Studio 2013 features auto-vectorization which I have turned off1. In other words, I have disabled the compiler’s ability to used SIMD automatically

SISD Simple Program Source

SISD Simple Program Disassembly

Disassembly of function add

SISD register usage

Sequential addition operations

SIMD Simple Program Source, Part 1

Definition of vec4_t

The xmmintrin.h header gives the programmer access to intrinsic types and intrinsic operations

There is an alignment requirement to store all four 32 bit floating point values in a 128 bit (or 16 byte) aligned intrinsic type

A four dimensional vector is composed of x, y, z, and w components

SIMD Simple Program Visualization

These example programs are using Streaming SIMD Extensions (SSE) intrinsics

1. SSE originally provides access to eight 128 bit registers known as XMM0 through XMM7

2. SSE is not the only intrinsics instruction set, there are many others such as AVX, Altivec, F16c, etc

SIMD Simple Program Source, Part 2

Both functions compute addition of two four dimensional vectors

1. The first function uses normal SISD instructions to do so, similar to our first SISD example

2. The second function uses _mm_add_ps, which is an intrinsic function that consumes two m128 types, does parallelized component addition using SIMD, and returns the resulting m128 type

3. As shown in the first function, we typically pass objects by const reference to function parameters for read operations. But with SIMD, the cost of indirection through references is ill-advised. It is cheaper to copy the values.

SIMD Simple Program Disassembly, Part 1

Partial disassembly of function add (SISD)

FPU (SISD) register usage

Sequential addition operations

SIMD Simple Program Disassembly, Part 2

Disassembly of function add (SIMD)

Immediate storage into SIMD registers

A single, parallelized add operation using addps

Challenges and InsightsA primer to thinking about SIMD programming

SIMD is not an automatic silver bullet

1. SIMD and intrinsic usage has a steep learning curve1. Documentation and good tutorials that show best-practice are often rare2. Advanced usage requires expert-level knowledge

2. It can be misused very easily1. Programmers may not entirely understand or analyze the control flow of data through SIMD

registers2. This can result in ultimately slower programs

Awareness when using SIMD

1. Do not allow implicit casts between SIMD types and SISD scalar types1. In order to get scalar values into the right SIMD locations, values must be copied from SISD registers

to SIMD registers2. There is a large performance penalty associated with copying, since the FPU and SIMD pipelines

must be flushed

2. Do not create a class to abstract the SIMD type, instead use a typedef1. This enables the compiler to perform specific optimizations per-platform more easily

3. Exposing the raw SIMD value, as was shown with the Vector4 struct, benefits manual optimization

4. Do not mix operator overloading with SIMD1. This can result in the compiler refraining from proper instruction reordering and scheduling because

it is now bound by mathematical operation rules

How to approach SIMD

1. As with all things, jump in and start small

2. “Premature optimization is the root of all evil” – Donald Knuth1. SIMD usage should be the result of identifying a potential and/or needed optimization within a

program’s run-time

3. Profile performance metrics as the program matures and SIMD usage is increased

4. Ensure that intrinsics used are compatible with all targeted hardware configurations

Further Reading and ReferencesFor those who seek knowledge regarding SIMD

The definitive reference for this presentation

Computer Architecture, Fifth EditionA Quantitative Approach

1.In-depth looks at modern CPU and GPU architectures

2.Thorough analysis of memory hierarchy design

3.Covers data, instruction, and thread level parallelism

This book is a highly recommended read.

References

1. Corden, Martyn. "Intel® Developer Zone." Intel® Compiler Options for Intel® SSE and Intel® AVX Generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and Processor-specific Optimizations. N.p., n.d. Web. 30 Sept. 2013.

2. Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. Amsterdam: Elsevier, 2012. Print.

3. Jha, Ashish, and Darren Yee. "Intel® Developer Zone." Increasing Memory Throughput With Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Streaming Load. N.p., n.d. Web. 30 Sept. 2013.

4. Siewert, Sam. "Intel® Developer Zone." Using Intel® Streaming SIMD Extensions and Intel® Integrated Performance Primitives to Accelerate Algorithms. N.p., n.d. Web. 30 Sept. 2013.

Software

Introduction to SIMD Programming