Introduction to Parallel Programming and Algorithms CS599 David Monismith

Introduction to Parallel Programming and Algorithms

CS599David Monismith

Assumptions

• You have some basic understanding of processes and threads

• You have had experience with programming with notepad and an IDE

• You have had experience programming in Java and in creating data structures

• You will be able to pick up C/C++ programming and learn to use a remote linux system relatively quickly

Outline

• Serial Computing• Moore’s Law• Power Law• Why Parallel Programming?• Amdahl’s Law• Flynn’s Taxonomy• Hello world!

– Thread-based– Directive-based– Process-based

Serial Computing

• Instructions executed one at a time.• Advances in microprocessor technology over

the past ten years have put multiprocessor chips in almost every new laptop and desktop.

• Importance of concurrency in computation recognized for many years.

• Serial processors have made use of concurrency (they have given the appearance of parallelism) for years.

Moore’s Law - Gordon Moore, Intel Corp.

• Note that Gordon Moore (co-founder of Intel) coined Moore's law– The number of transistors on chip doubles every

18 months• For some time, processor speeds followed this

law as well, but eventually, a power wall was reached.

http://en.wikipedia.org/wiki/Moore%27s_law#mediaviewer/File:Transistor_Count_and_Moore%27s_Law_-_2011.svg

Energy Consumption and the Power Wall

• How power is used by the CPU? • What is the amount of power per unit of computation?

– Note that there is a “power wall” that currently limits the on chip clock frequency.– Clock frequency, sometimes called “speed” of the chip is related to the inverse

square of the voltage applied.– Frequency ~ 1/V2

• Voltage applied is directly related to power (wattage), which in turn is directly related to heat.

• Faster clocks = more heat• Since the late 1990’s/early 2000’s, there has been a push to add more

parallelism on chip rather than higher clock speeds• Dr. Edward Bosworth provides a more detailed description of the “power

wall” at http://www.edwardbosworth.com/My5155_Slides/Chapter01/ThePowerWall.doc

http://www.edwardbosworth.com/My5155_Slides/Chapter01/ThePowerWall.doc

http://www.edwardbosworth.com/My5155_Slides/Chapter01/ThePowerWall.doc

Why Parallel Computing?

• Ubiquity of parallel hardware/architectures• Lowered cost of components• Increased ease of parallel programming

Means of Parallel Computing

• Multiple datapaths– For example, superscalar systems, additional busses, etc.

• Higher accessibility to storage (e.g. parallel data access)– PVFS/Lustre Filesystems

• Performance that scales up– Improved performance with both better resources on

system (scaling up) and by using more systems (scaling out)• Threads – lightweight processes• Processes – running programs

Parallel Computing

• Simply put, using more than one computational resource to solve a computational problem.

Lowered Cost of Components

• Xeon E5-2680 - $1745 tray price - 12 cpu cores, 2 threads/core, 2.5GHz, 30MB L3 Cache, 120W– Source: http://intel.com

• ARM Cortex A9 (AM4379, 32 bit) - $15/unit volume price - 1 A9 core + 4 PRU-ICSS (Programmable real time unit and industrial communication subsystem) cores, 1GHz, 256KB L2 Cache, Approx. 1W max– Sources: http

://www.ti.com/product/AM4379/technicaldocuments?dcmp=dsproject&hqs=td&#doctype2

– http://linuxgizmos.com/ti-spins-cortex-a9-sitara-soc/

http://intel.com/

http://www.ti.com/product/AM4379/technicaldocuments?dcmp=dsproject&hqs=td&%23doctype2




http://linuxgizmos.com/ti-spins-cortex-a9-sitara-soc/



Examples of Parallel Computing Problems

• Web servers (Amazon.com, google, etc.)• Database servers (University data system)• Graphics and visualization (XSEDE)• Weather (Forecasting – NOAA RapidRefresh)• Biology (DNA/RNA Longest Common

Subsequence)• Physics (Firing a Hydrogen Ion at a surface of

metal atoms)• Many more . . .

Speedup

• Parallelizing a problem can result in speedup• Patterson and Hennessy defines performance

as 1/execution time for a given task• Performance A / Performance B is Speedup• If, PerfA/PerfB = 2, then, Machine A is twice as

fast as machine B for this task• This means the speedup for running this task

on Machine A is 2

Amdahl’s Law• But wait! We can’t speed up every task.• Some tasks are inherently serial and sometimes a group of

parallel tasks must complete before a program can continue.• So, there are limitations to parallelization.• Gene Amdahl noticed that “a program can only run as fast as its

slowest part.”• For speedup of a sequential program using parallelism, such a

program can only improve to the runtime of the part that cannot be parallelized.

• Effectively, you can't go any faster than your slowest part.• The serial portion will dictate the highest possible speed in both

Hardware and Software

Computing Parallel Execution TimeNew_Execution_Time = Parallelizable_Execution_Time/Parallel_Factor + Sequential_Execution_Time• Assume a program with a 90s run time on a sequential processor where 80s is

parallelizable. 10s is the fastest the program could ever run. Note - 10s = 80s/infinity + 10s.

• If we use a dual core processor the run time is 50s = 80s/2 + 10s• If we use a quad core processor the run time is 30s = 80s/4 + 10s• If we use an 8 core processor the run time is 20s = 80s/8 + 10s

The speedup using a quad (4) core processor vs. a sequential processor is 90s (seq. runtime) / 30s (quad core runtime) = 3

The maximum theoretical speedup is 90s (seq. runtime) / 10s (infinite parallelism) = 9

• Based on Amdahl's law we should make the most commonly used code (i.e. loops) parallel if possible.

Parallel Patterns (According to Michael McCool of Intel)

• Superscalar Sequences/Task Graphs

• Speculative Selection• Map• Gather• Stencils• Partition• Reduce

• Reduce• Pack• Scatter• Histogram

Memory/Disk Speed Issues

Graph source: Patterson and Hennessy, Computer Architecture: A Quantitative ApproachImage source: http://dave.cheney.net/wp-content/uploads/2014/06/Gocon-2014-10.jpg

Flynn’s Taxonomy

• SISD = Single Instruction Single Data = Serial Programming• SIMD = Single Instruction Multiple Data = Implicit Parallelism

(Instruction/Architecture Level)• MISD = Multiple Instruction Single Data (Rarely implemented)• MIMD = Multiple Instruction Multiple Data = Multiprocessor

Single Data Multiple Data

Single Instruction SISD SIMD

Multiple Instruction MISD MIMD

Flynn’s Taxonomy

• SIMD instructions and architectures allow for implicit parallelism when writing programs

• To provide a sense of how these work, pipelines and superscalar architectures are discussed in the next set of slides

• Our focus, however, will be on MIMD through the use of processes and threads, and we will look at examples shortly

Pipelined Instructions• Divide instructions into

multiple stages• Allow at most one

instruction to be in one stage at a time

• Maximum instruction throughput is equal to the number of stages

• Memory access is only performed on instructions for which it is required otherwise, it is generally skipped

Instruction Stages

IF = Instruction Fetch

ID = Instruction Decode

EX = Execute

MA = Memory Access

WB = Writeback

Pipelined Instructions

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

Superscalar Architecture

• Allow for more than one pipeline to run at the same time

• Allows for parallelism, but only provides ideal speedup if instructions are independent

• Most instructions are, however, not independent, (e.g. mathematics and branches) so complex logic may be needed for branch prediction and hazard detection

Superscalar PipelineIF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

IF ID EX MA WB

Processes and Threads

• Our main focus for this course

• These exist only at execution time

• They have fast state changes -> in memory and waiting

• A Process – is a fundamental computation unit– can have one or more threads– is handled by process management module– requires system resources

Process

• Process (job) - program in execution, ready to execute, or waiting for execution

• A program is static whereas a process is dynamic.• We will implement processes using an API called the

Message Passing Interface (MPI)• MPI will provide us with an abstract layer that will allow us

to create and identify processes without worrying about the creation of data structures for sockets or shared memory

• We will discuss process based programming in detail starting in week 4

Threads

• threads - lightweight processes– Dynamic component of processes– Often, many threads are part of a process

• Current OSes support multithreading– multiple threads (tasks) per process

• Execution of threads is handled more efficiently than that of full weight processes (although there are other costs).

• At process creation, one thread is created, the "main" thread.

• Other threads are created from the "main" thread

Threads in C (POSIX)

• Use a function to represent a thread• This function must have the following format:

void * ThreadName(void * threadArg){ //Do thread work here

//Cause the thread to exit pthread_exit(NULL);}

POSIX Threads (pthreads)

• pthread_exit(NULL) terminates the calling thread.

• Even though a void pointer is supposed to be returned, there is no explicit return statement within the thread function.

• This function actually provides its return value through the pthread_exit function.

• Here, NULL is the return value.

pthread_create

• pthread_create does the following: it takes • 1) the address of the variable where a the

identifier of the of the thread will be stored,• 2) the thread attributes,• 3) the name of the function containing the code

for the thread, and • 4) the parameter passed into the thread as

arguments.

pthread_create

• The prototype for this function follows:

int pthread_create(pthread_t * thread,const pthread_attr_t * attr,void * (*start_routine) (void *), void * arg);

• The function then creates the thread, stores its identifier in the variable thread, starts the thread using start_routine, and passes in the argument arg.

• See "man pthread_create" on Littlefe.

pthread_exit

• Causes a thread to exit.• See "man pthread_exit" for more info.• Especially for information about returning

parameter values.

pthread_t

• pthread_t is the pthread type.• This type can store a unique identifier to a thread.• Note that thread ids are only guaranteed to be

unique within a single process as threads are specific to a process.

• This means that you should not attempt to pass a pointer to a thread to a process different from the one in which the thread was created.

What is OpenMP? (Answers per LLNL)

• Open Multi-Processing– Provided via open specifications by work between

academia, corporations and government• Provides a standardized shared memory multi-

programming environment• A directive and functional based API• Parallelism easily achieved with but a few

simple directives

OpenMP Threads

• OpenMP Threads are implemented using directive based programming

• The number of threads is determined using an environment variable called OMP_NUM_THREADS

• In C and C++, these are created using #pragma omp statements

• For example, a block of code that is preceded by #pragma omp parallel would be threaded

In-class Exercise

• Log in to one of the LittleFe clusters and complete the Linux tutorial if you have not completed it in the past.– Complete and run “Hello, world!” programs for

OpenMP, pthreads, and MPI• Install Cygwin and Eclipse for Parallel

Application Developers on your laptop

A Simple (Serial) C Program• Open an editor by typing: • nano hello_world.c

#include <stdio.h> //Similar to import int main(int argc, char ** argv) //argc is the number of command line arguments //argv is the array of command line arguments (strings){ printf("Hello, world\n"); return 0;}

Compilation• Compile by typing the following:

• gcc hello_world.c -o hello_world.exe

• Notice that you get an executable file that is really in machine language (not byte code).

• You should see errors if you have made mistakes. You'll see no output if your program compiles correctly.

• In bash at the $ prompt, type:

• ./hello_world.exe to run

C Program using the MPI Library#include <stdio.h> /* printf, scanf, . . . */#include <stdlib.h> /* atop, atof, . . ., malloc, calloc, free, . . . */#include <mpi.h> /* MPI function calls */

int main (int argc, char ** argv){ int my_rank; //This process's id int number_of_processes; int mpi_error_code;

mpi_error_code = MPI_Init(&argv, &argc); mpi_error_code = MPI_Comm_rank( MPI_COMM_WORLD, &my_rank); mpi_error_code = MPI_Comm_size(MPI_COMM_WORLD, &number_of_processes);

printf("%d of %d: Hello, world!\n", my_rank, number_of_processes);

mpi_error_code = MPI_Finalize(); return 0;}

Compile and Run a C/MPI Program• Compile with:

• mpicc hello_world_mpi.c -o hello_world_mpi.exe

• Run with:

• mpirun -n 10 –machinefile machines hello_world_mpi.exe

• Everything between "MPI_Init" and "MPI_Finalize" runs as many times as there are processes

• Copies of each variable are made for each process. Each of these copies is distinct.

Identifying a Process

• A process’s rank identifies it within MPI• This is retrieved using the MPI_Comm_rank

function and stored in my_rank in the example program

• The number of processes is retrieved using the MPI_Comm_size function and stored in number_of_processes in the example program

Example Program Execution

Process 0

• rank = 0• size = 10

Process 1

• rank = 1• size = 10

Process 2

• rank = 2• size = 10

Process 9

• rank = 9• size = 10

. . .

Processes each have their own copies of all variables declared within the code.

They also have their own memory environment, but they all execute the same code.

A Simple Pthreads Program

#include <stdio.h>#include <pthread.h>

#define NUM_THREADS 4

void * helloThread(void * tid){ int threadId = (int)tid; printf("Hello from thread %d\n", threadId); pthread_exit(NULL);}

A Simple Pthreads Programint main(int argc, char ** argv){ pthread_t threadList[NUM_THREADS]; int i;

for(i = 0; i < NUM_THREADS; i++) pthread_create(&threadList[i], NULL, helloThread, (void *)i);

for(i = 0; i < NUM_THREADS; i++) pthread_join(threadList[i], NULL);

return 0;}

Compile and Run a PThreads Program

• Compile with:

• gcc pthreadsExample.c -pthread -o pthreadsExample.exe

• Run with:

• ./pthreadsExample.exe

A Simple OpenMP Example// Based upon LLNL Example code from // https://computing.llnl.gov/tutorials/openMP/#include <omp.h>#include <stdio.h>

int main(int argc, char ** argv) {int tid;

#pragma omp parallel private(tid){

tid = omp_get_thread_num();printf("Hello from thread number %d\n", tid);

} //Threads join and the program finishes return 0;}

Compile and Run an OpenMP Program

• Compile with:

• gcc ompExample.c -fopenmp -o ompExample.exe

• Run with:

• ./ompExample.exe

Referenced Texts

• Patterson & Hennessey, Computer Organization: The Hardware/Software Interface

• Patterson & Hennessy, Computer Architecture: A Quantitative Approach

• Grama, Gupta, Karypis, and Kumar, Introduction to Parallel Computing, Second Edition

Reading Assignment for This Week

• C Programming Slides – see course website – http://monismith.info/cs599/notes.html

• LLNL pthreads tutorial - https://computing.llnl.gov/tutorials/pthreads/

• LLNL OpenMP tutorial - https://computing.llnl.gov/tutorials/openMP/

http://monismith.info/cs599/notes.html

https://computing.llnl.gov/tutorials/pthreads/

https://computing.llnl.gov/tutorials/pthreads/

https://computing.llnl.gov/tutorials/openMP/

https://computing.llnl.gov/tutorials/openMP/

Documents

Introduction to Parallel Programming and Algorithms CS599 David Monismith