49
Multi-core Programming Threading Concepts

Multi-core Programming

  • Upload
    freya

  • View
    144

  • Download
    0

Embed Size (px)

DESCRIPTION

Multi-core Programming. Threading Concepts. Topics. A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues. What is Parallelism?. Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes - PowerPoint PPT Presentation

Citation preview

Page 1: Multi-core Programming

Multi-core Programming

Threading Concepts

Page 2: Multi-core Programming

Basics of VTune™ Performance Analyzer

2

Topics

• A Generic Development Cycle• Case Study: Prime Number Generation• Common Performance Issues

Page 3: Multi-core Programming

Threaded Programming Methodology 3

What is Parallelism?

• Two or more processes or threads execute at the same time

• Parallelism for threading architectures– Multiple processes

• Communication through Inter-Process Communication (IPC)

– Single process, multiple threads• Communication through shared memory

Page 4: Multi-core Programming

Threaded Programming Methodology 4

n = number of processors

Tparallel = {(1-P) + P/n} Tserial

Speedup = Tserial / Tparallel

Amdahl’s LawDescribes the upper bound of parallel execution speedup

Serial code limits speedup

(1-P

)P

T ser

ial

(1-P

)

P/2

0.5 + 0.25

1.0/0.75 = 1.33

n = 2n = ∞

P/∞…

0.5 + 0.0

1.0/0.5 = 2.0

P - proportion of a computation where parallelization may occur..

Page 5: Multi-core Programming

5Threaded Programming Methodology

Processes and Threads

• Modern operating systems load programs as processes

• Resource holder• Execution

• A process starts executing at its entry point as a thread

• Threads can create other threads within the process– Each thread gets its own stack

• All threads within a process share code & data segments

Code segment

Data segment

threadmain()

…thread thread

Stack Stack

Stack

Page 6: Multi-core Programming

Threaded Programming Methodology 6

Threads – Benefits & Risks

• Benefits– Increased performance and better resource

utilization• Even on single processor systems - for hiding latency

and increasing throughput– IPC through shared memory is more efficient

• Risks– Increases complexity of the application– Difficult to debug (data races, deadlocks, etc.)

Page 7: Multi-core Programming

Threaded Programming Methodology 7

Commonly Encountered Questions with Threading Applications

• Where to thread?• How long would it take to thread?• How much re-design/effort is required?• Is it worth threading a selected region?• What should the expected speedup be?• Will the performance meet expectations?• Will it scale as more threads/data are added?• Which threading model to use?

Page 8: Multi-core Programming

Threaded Programming Methodology 8

Prime Number Generationbool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++;

return (factor > limit);}

void FindPrimes(int start, int end){ int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}

i factor

61 3 5 7 63 365 3 567 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 577 3 5 7 79 3 5 7 9

Page 9: Multi-core Programming

Threaded Programming Methodology 9

Development Methodology

• Analysis– Find computationally intense code

• Design (Introduce Threads)– Determine how to implement threading solution

• Debug for correctness– Detect any problems resulting from using threads

• Tune for performance– Achieve best parallel performance

Page 10: Multi-core Programming

Threaded Programming Methodology 10

Development Cycle Analysis

– VTune™ Performance AnalyzerDesign (Introduce Threads)

– Intel® Performance libraries: IPP and MKL– OpenMP* (Intel® Compiler)– Explicit threading (Win32*, Pthreads*)

Debug for correctness– Intel® Thread Checker– Intel Debugger

Tune for performance– Intel® Thread Profiler– VTune™ Performance Analyzer

Page 11: Multi-core Programming

Threaded Programming Methodology 11

Let’s use the project PrimeSingle for analysis • PrimeSingle <start> <end>

Usage: ./PrimeSingle 1 1000000

Analysis - Sampling

• Use VTune Sampling to find hotspots in application

bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++;

return (factor > limit);}

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}

Identifies the time consuming regions

Page 12: Multi-core Programming

Threaded Programming Methodology 12

Analysis - Call Graph

This is the level in the call tree where we need to thread

Used to find proper level in the call-tree to thread

Page 13: Multi-core Programming

Threaded Programming Methodology 13

Analysis• Where to thread?

– FindPrimes()• Is it worth threading a selected region?

– Appears to have minimal dependencies– Appears to be data-parallel– Consumes over 95% of the run time

Baseline measurement

Page 14: Multi-core Programming

Threaded Programming Methodology 14

Foster’s Design Methodology

• From Designing and Building Parallel Programs by Ian Foster

• Four Steps:– Partitioning

• Dividing computation and data– Communication

• Sharing data between computations– Agglomeration

• Grouping tasks to improve performance– Mapping

• Assigning tasks to processors/threads

Page 15: Multi-core Programming

15Threaded Programming Methodology

Designing Threaded Programs

• Partition

– Divide problem into tasks

• Communicate

– Determine amount and pattern of communication

• Agglomerate

– Combine tasks• Map

– Assign agglomerated tasks to created threads

TheProblem

Initial tasks

Communication

Combined Tasks

Final Program

Page 16: Multi-core Programming

Threaded Programming Methodology 16

Parallel Programming Models

• Functional Decomposition– Task parallelism– Divide the computation, then associate the data– Independent tasks of the same problem

• Data Decomposition– Same operation performed on different data– Divide data into pieces, then associate

computation

Page 17: Multi-core Programming

Threaded Programming Methodology 17

Decomposition Methods• Functional Decomposition

– Focusing on computations can reveal structure in a problem

Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC

Domain Decomposition• Focus on largest or most frequently

accessed data structure• Data Parallelism

• Same operation applied to all data

Atmosphere Model

Ocean Model

Land Surface Model

Hydrology Model

Page 18: Multi-core Programming

Threaded Programming Methodology 18

Pipelined Decomposition

• Computation done in independent stages• Functional decomposition

– Threads are assigned stage to compute– Automobile assembly line

• Data decomposition– Thread processes all stages of single instance– One worker builds an entire car

Page 19: Multi-core Programming

Threaded Programming Methodology 19

LAME Pipeline Strategy

Frame NFrame N + 1

Time

Other N

Prelude N

Acoustics N

Encoding N

T 2

T 1

Acoustics N+1

Prelude N+1

Other N+1

Encoding N+1

Acoustics N+2

Prelude N+2

T 3

T 4

Prelude N+3Hierarchical Barrier

OtherPrelude Acoustics Encoding

FrameFetch next frameFrame characterizationSet encode parameters

Psycho AnalysisFFT long/shortFilter assemblage

Apply filteringNoise ShapingQuantize & Count bits

Add frame headerCheck correctnessWrite to disk

Page 20: Multi-core Programming

Threaded Programming Methodology 20

Design

• What is the expected benefit?

• How do you achieve this with the least effort?

• How long would it take to thread?• How much re-design/effort is required?

Rapid prototyping with OpenMP

Speedup(2P) = 100/(96/2+4) = ~1.92X

Page 21: Multi-core Programming

Threaded Programming Methodology 21

OpenMP

• Fork-join parallelism: – Master thread spawns a team of threads as

needed– Parallelism is added incrementally

• Sequential program evolves into a parallel program

Parallel Regions

Master Thread

Page 22: Multi-core Programming

Threaded Programming Methodology 22

Design#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }

OpenMP

Create threads here for this parallel region

Divide iterations of the for loop

Page 23: Multi-core Programming

Threaded Programming Methodology 23

Design

• What is the expected benefit?• How do you achieve this with the least effort?

• How long would it take to thread?• How much re-design/effort is required?• Is this the best speedup possible?

Speedup of 1.40X (less than 1.92X)

Page 24: Multi-core Programming

Threaded Programming Methodology 24

Debugging for Correctness

• Is this threaded implementation right?• No! The answers are different each time …

Page 25: Multi-core Programming

Threaded Programming Methodology 25

Debugging for Correctness

Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks

Intel® Thread CheckerVTune™ Performance Analyzer

+DLLs (Instrumented)

Binary InstrumentationPrimes.exe Primes.exe

(Instrumented)

Runtime Data

Collector

threadchecker.thr(result file)

Page 26: Multi-core Programming

Threaded Programming Methodology 26

Thread Checker

Page 27: Multi-core Programming

Threaded Programming Methodology 27

Debugging for Correctness

• How much re-design/effort is required?

• How long would it take to thread?

Thread Checker reported only 2 dependencies, so effort required

should be low

Page 28: Multi-core Programming

Threaded Programming Methodology 28

Debugging for Correctness#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }

#pragma omp critical { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) }

Will create a critical section for this reference

Will create a critical section for both these references

Page 29: Multi-core Programming

Threaded Programming Methodology 29

Correctness

• Correct answer, but performance has slipped to ~1.33X

• Is this the best we can expect from this algorithm?

No! From Amdahl’s Law, we expect speedup close to 1.9X

Page 30: Multi-core Programming

Threaded Programming Methodology 30

Common Performance Issues• Parallel Overhead

– Due to thread creation, scheduling …• Synchronization

– Excessive use of global data, contention for the same synchronization object

• Load Imbalance– Improper distribution of parallel work

• Granularity– No sufficient parallel work

Page 31: Multi-core Programming

Threaded Programming Methodology 31

Tuning for Performance

Thread Profiler pinpoints performance bottlenecks in threaded applications

Thread ProfilerVTune™ Performance Analyzer

+DLL’s (Instrumented)

Binary Instrumentation

Primes.cPrimes.exe

(Instrumented)

Runtime Data

Collector

Bistro.tp/guide.gvs(result file)

CompilerSource

Instrumentation

Primes.exe

/Qopenmp_profile

Page 32: Multi-core Programming

32Threaded Programming

Methodology

Thread Profiler for OpenMP

Page 33: Multi-core Programming

Threaded Programming Methodology 33

Thread Profiler for OpenMPSpeedup Graph

Estimates threading speedup and potential speedup – Based on Amdahl’s Law

computationGives upper and lower bound estimates

Page 34: Multi-core Programming

Threaded Programming Methodology 34

Thread Profiler for OpenMPserial

serial

parallel

Page 35: Multi-core Programming

Threaded Programming Methodology 35

Thread Profiler for OpenMPThread 0

Thread 1

Thread 2

Thread 3

Page 36: Multi-core Programming

36 Threaded Programming Methodology

Thread Profiler (for Explicit Threads)

Page 37: Multi-core Programming

37 Threaded Programming Methodology

Thread Profiler (for Explicit Threads)

Why so many transitions?

Page 38: Multi-core Programming

Threaded Programming Methodology 38

Performance• This implementation has implicit synchronization calls• This limits scaling performance due to the resulting context

switches

Back to the design stage

Page 39: Multi-core Programming

Threaded Programming Methodology 39

Is that much contention expected?

The algorithm has many more updates than the 10 needed for showing progress

Performance

void ShowProgress( int val, int range ){ int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f);

if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone);}

void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

This change should fix the contention issue

Page 40: Multi-core Programming

Threaded Programming Methodology 40

Design

• Goals– Eliminate the contention due to implicit

synchronization

Speedup is 2.32X ! Is that right?

Page 41: Multi-core Programming

Threaded Programming Methodology 41

Performance

• Our original baseline measurement had the “flawed” progress update algorithm

• Is this the best we can expect from this algorithm?

Speedup is actually 1.40X (<<1.9X)!

Page 42: Multi-core Programming

42 Threaded Programming Methodology

Performance Re-visited

Still have 62% of execution time in locks

and synchronization

Page 43: Multi-core Programming

Threaded Programming Methodology 43

Performance Re-visited

Let’s look at the OpenMP locks…void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) )#pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}

Lock is in a loop

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}

Page 44: Multi-core Programming

Threaded Programming Methodology 44

Performance Re-visitedLet’s look at the second lock

void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;

#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

This lock is also being called within a loop

void ShowProgress( int val, int range ){ long percentDone, localProgress; static int lastPercentDone = 0;

localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

Page 45: Multi-core Programming

45

Thread Profiler for OpenMP

Thread 0342 factors to test 116747

Thread 1612 factors to test 373553

Thread 2789 factors to test 623759

Thread 3934 factors to test 873913

500000

250000

750000

1000000

Page 46: Multi-core Programming

Threaded Programming Methodology 46

Fixing the Load ImbalanceDistribute the work more evenly

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}

Speedup achieved is 1.68X

Page 47: Multi-core Programming

47 Threaded Programming Methodology

Final Thread Profiler Run

Speedup achieved is 1.80X

Page 48: Multi-core Programming

48 Threaded Programming Methodology

Comparative Analysis

Threading applications require multiple iterations of going through

the software development cycle

Page 49: Multi-core Programming

Threaded Programming Methodology 49

Threading MethodologyWhat’s Been Covered

• Four step development cycle for writing threaded code from serial and the Intel® tools that support each step– Analysis– Design (Introduce Threads)– Debug for correctness– Tune for performance

• Threading applications require multiple iterations of designing, debugging and performance tuning steps

• Use tools to improve productivity