27
Accelerating Real-time Program Performance Koray Hagen An Introduction to SIMD and Hardware Intrinsics

Introduction to SIMD Programming

Embed Size (px)

DESCRIPTION

An introductory presentation for students regarding SIMD programming.

Citation preview

Page 1: Introduction to SIMD Programming

Accelerating Real-time Program Performance

Koray Hagen

An Introduction to SIMD and Hardware Intrinsics

Page 2: Introduction to SIMD Programming

Prerequisite Knowledge

1. Exposure to C++

2. Exposure to an assembly language

And that’s it.

Page 3: Introduction to SIMD Programming

The Agenda

1. Terminology and an introduction to SIMD

2. Examples and benefits of SIMD Programming in C++

3. Some tradeoffs, caveats, and insights

4. Closing thoughts and further reading

Page 4: Introduction to SIMD Programming

Instructions and DataClassification of computer architectures

Page 5: Introduction to SIMD Programming

Terminology and Definitions

1. Concurrent – Events or processes which appear to occur or progress at the same time

1. Related terms often heard:1. Threads2. Mutexes3. Semaphores4. Monitors

2. Parallel – Events or processes which do occur or progress at the same time1. Parallel programming is not the same as concurrent programming, they are distinct2. It refers to techniques that provides parallel execution of operations

SIMD is a computer architecture that makes use of parallel execution

Page 6: Introduction to SIMD Programming

Flynn’s Taxonomy

A classification of computer architectures that was proposed by Michael Flynn in 1966

1.Characterized by a few criteria1. Parallelism exhibited with its instruction streams2. Parallelism exhibited with its data streams

2.Four distinct classifications1. SISD2. SIMD3. MISD4. MIMD

The instruction stream and data stream can both be either single (S) or multiple (M)

Page 7: Introduction to SIMD Programming

SISD – Single Instruction Single Data

1. Single CPU systems1. Specifically uniprocessors 2. Co-processors do not count.

2. Concurrent processing1. Pipelined execution2. Prefetching

3. Concurrent Execution1. Independent concurrent tasks can execute

different sequences of operations

4. Examples1. Personal computers

Page 8: Introduction to SIMD Programming

SIMD – Single Instruction Multiple Data

1. One instruction stream is broadcasted to all processors

1. In reality, that processor will be simple, such as an ALU (Arithmetic Logic Unit)

2. All active processors execute the same instructions synchronously, but on different data

3. The data items are aligned in an array (or vector)

4. An instruction can act on the complete array in one cycle

5. Examples later on

Page 9: Introduction to SIMD Programming

MISD – Multiple Instruction Single Data

1. Uncommon architecture type that is used for specialized purposes

1. Redundancy2. Fault Tolerance3. Task replication

2. Example1. Space shuttle flight control computer

Page 10: Introduction to SIMD Programming

MIMD – Multiple Instruction Multiple Data

1. Processors are asynchronous1. Have the ability to independently execute on

different data sets

2. Communication is handled by either1. Shared or distributed memory2. Message passing

3. Examples1. Computer aided manufacturing2. Simulation3. Modeling4. Communication switches

Page 11: Introduction to SIMD Programming

The importance of SIMD

1. SIMD intrinsics are supported by most modern CPU’s in commercial use1. As time passes, that support is ever increasing

2. Intel Haswell (2013) microarchitecture supports MMX, SSE 1 - SSE 4.2, AVX - AVX2, etc.3. Intel Pentium (1996) microarchitecture supported MMX for the first time

2. SIMD as an architecture is conceptually easier to leverage, understand, and debug for programmers

1. It bears large similarities to sequential and concurrent programming

3. SIMD usage has been successful in multimedia application for decades1. Instruction and data level parallelism have many real-world applications

Page 12: Introduction to SIMD Programming

SIMD ProgrammingC++ parallel computation examples

Page 13: Introduction to SIMD Programming

But first, some additional information

1. I will be showing two real-world examples for this portion1. The first example is simple and will showcase SISD2. The second example is simple and will showcase SIMD

2. All programs were written in C++ on Windows 8.1, using Visual Studio 2013

3. Visual Studio 2013 features auto-vectorization which I have turned off1. In other words, I have disabled the compiler’s ability to used SIMD automatically

Page 14: Introduction to SIMD Programming

SISD Simple Program Source

Page 15: Introduction to SIMD Programming

SISD Simple Program Disassembly

Disassembly of function add

SISD register usage

Sequential addition operations

Page 16: Introduction to SIMD Programming

SIMD Simple Program Source, Part 1

Definition of vec4_t

The xmmintrin.h header gives the programmer access to intrinsic types and intrinsic operations

There is an alignment requirement to store all four 32 bit floating point values in a 128 bit (or 16 byte) aligned intrinsic type

A four dimensional vector is composed of x, y, z, and w components

Page 17: Introduction to SIMD Programming

SIMD Simple Program Visualization

These example programs are using Streaming SIMD Extensions (SSE) intrinsics

1. SSE originally provides access to eight 128 bit registers known as XMM0 through XMM7

2. SSE is not the only intrinsics instruction set, there are many others such as AVX, Altivec, F16c, etc

Page 18: Introduction to SIMD Programming

SIMD Simple Program Source, Part 2

Both functions compute addition of two four dimensional vectors

1. The first function uses normal SISD instructions to do so, similar to our first SISD example

2. The second function uses _mm_add_ps, which is an intrinsic function that consumes two m128 types, does parallelized component addition using SIMD, and returns the resulting m128 type

3. As shown in the first function, we typically pass objects by const reference to function parameters for read operations. But with SIMD, the cost of indirection through references is ill-advised. It is cheaper to copy the values.

Page 19: Introduction to SIMD Programming

SIMD Simple Program Disassembly, Part 1

Partial disassembly of function add (SISD)

FPU (SISD) register usage

Sequential addition operations

Page 20: Introduction to SIMD Programming

SIMD Simple Program Disassembly, Part 2

Disassembly of function add (SIMD)

Immediate storage into SIMD registers

A single, parallelized add operation using addps

Page 21: Introduction to SIMD Programming

Challenges and InsightsA primer to thinking about SIMD programming

Page 22: Introduction to SIMD Programming

SIMD is not an automatic silver bullet

1. SIMD and intrinsic usage has a steep learning curve1. Documentation and good tutorials that show best-practice are often rare2. Advanced usage requires expert-level knowledge

2. It can be misused very easily1. Programmers may not entirely understand or analyze the control flow of data through SIMD

registers2. This can result in ultimately slower programs

Page 23: Introduction to SIMD Programming

Awareness when using SIMD

1. Do not allow implicit casts between SIMD types and SISD scalar types1. In order to get scalar values into the right SIMD locations, values must be copied from SISD registers

to SIMD registers2. There is a large performance penalty associated with copying, since the FPU and SIMD pipelines

must be flushed

2. Do not create a class to abstract the SIMD type, instead use a typedef1. This enables the compiler to perform specific optimizations per-platform more easily

3. Exposing the raw SIMD value, as was shown with the Vector4 struct, benefits manual optimization

4. Do not mix operator overloading with SIMD1. This can result in the compiler refraining from proper instruction reordering and scheduling because

it is now bound by mathematical operation rules

Page 24: Introduction to SIMD Programming

How to approach SIMD

1. As with all things, jump in and start small

2. “Premature optimization is the root of all evil” – Donald Knuth1. SIMD usage should be the result of identifying a potential and/or needed optimization within a

program’s run-time

3. Profile performance metrics as the program matures and SIMD usage is increased

4. Ensure that intrinsics used are compatible with all targeted hardware configurations

Page 25: Introduction to SIMD Programming

Further Reading and ReferencesFor those who seek knowledge regarding SIMD

Page 26: Introduction to SIMD Programming

The definitive reference for this presentation

Computer Architecture, Fifth EditionA Quantitative Approach

1.In-depth looks at modern CPU and GPU architectures

2.Thorough analysis of memory hierarchy design

3.Covers data, instruction, and thread level parallelism

This book is a highly recommended read.

Page 27: Introduction to SIMD Programming

References

1. Corden, Martyn. "Intel® Developer Zone." Intel® Compiler Options for Intel® SSE and Intel® AVX Generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and Processor-specific Optimizations. N.p., n.d. Web. 30 Sept. 2013.

2. Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. Amsterdam: Elsevier, 2012. Print.

3. Jha, Ashish, and Darren Yee. "Intel® Developer Zone." Increasing Memory Throughput With Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Streaming Load. N.p., n.d. Web. 30 Sept. 2013.

4. Siewert, Sam. "Intel® Developer Zone." Using Intel® Streaming SIMD Extensions and Intel® Integrated Performance Primitives to Accelerate Algorithms. N.p., n.d. Web. 30 Sept. 2013.