Advanced Computing Techniques & Applications

Advanced Computing Techniques &

Applications

Dr. Bo Yuan

E-mail: yuanb@sz.tsinghua.edu.cn

Course Profile

• Lecturer: Dr. Bo Yuan

• Contact– Phone: 2603 6067– E-mail: yuanb@sz.tsinghua.edu.cn– Room: F-301B

• Time: 10:25 am – 12:00pm, Friday

• Venue: CI-107 & B-204 (Lab)

• Teaching Assistant– Mr. Pengtao Huang– hpt13@mails.tsinghua.edu.cn

We will study ...• MPI

– Message Passing Interface– API for distributed memory parallel computing (multiple processes)– The dominant model used in cluster computing

• OpenMP– Open Multi-Processing– API for shared memory parallel computing (multiple threads)

• GPU Computing with CUDA– Graphics Processing Unit– Compute Unified Device Architecture– API for shared memory parallel computing in C (multiple threads)

• Parallel Matlab– A popular high-level technical computing language and interactive environment

Aims & Objectives

• Learning Objectives– Understand the main issues and core techniques in parallel computing.

– Able to develop MPI based parallel programs.

– Able to develop OpenMP based parallel programs.

– Able to develop GPU based parallel programs.

– Able to develop Matlab based parallel programs.

• Graduate Attributes– In-depth Knowledge of the Field of Study

– Effective Communication

– Independence and Teamwork

– Critical Judgment

Learning Activities• Lecture (9)

– Introduction (3)– MPI and OpenMP (3)– GPU Computing (3)

• Practice (4)– MPI (1)– OpenMP (1)– GPU Programming (1)– Parallel Matlab (1)

• Others (3)– Industry Tour (1)– Presentation (1)– Final Exam (1)

Learning Resources

Learning Resources• Books

– http://www.mcs.anl.gov/~itf/dbpp/– https://computing.llnl.gov/tutorials/parallel_comp/– http://www-users.cs.umn.edu/~karypis/parbook/

• Journals– http://www.computer.org/tpds– http://www.journals.elsevier.com/parallel-computing/– http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/

• Amazon Cloud Computing Services– http://aws.amazon.com

• CUDA– http://developer.nvidia.com

Learning Resources

https://www.coursera.org/course/hetero

Assessment

20% 40% 40%

Group Project

https://developer.nvidia.com/embedded-computing

Rules & Policies• Plagiarism– Plagiarism is the act of misrepresenting as one's own original work the ideas,

interpretations, words or creative works of another.

– Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence.

– Presenting as independent work done in collaboration with others.

– Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these.

– Paraphrasing, summarizing or simply rearranging another person's words, ideas, without changing the basic structure and/or meaning of the text.

– Copying or adapting another student's original work into a submitted assessment item.

Rules & Policies

• Late Submission– Late submissions will incur a penalty of 10% of the total marks for each day that the

submission is late (including weekends). Submissions more than 5 days late will not be accepted.

• Assumed Background– Acquaintance with C language is essential.– Knowledge of computer architecture is beneficial.

• We have CUDA supported GPU cards available!

Half Adder

A: Augend B: Addend

S: Sum C: Carry

Full Adder

SR Latch

S R Q0 0 Q

1 1 N/A

Address Decoder

Electronic Numerical Integrator And Computer

• Speed (10-digit decimal numbers)– Machine Cycle: 5000 cycles per second– Multiplication: 357 times per second– Division/Square Root: 35 times per second

• Programming– Programmable– Switches and Cables– Usually took days.– I/O: Punched Cards

Stored-Program Computer

Personal Computer in 1980s

BASIC IBM PC/AT

Top 500 SupercomputersG

Cost of ComputingDate Approximate cost per GFLOPS Approximate cost per GFLOPS

inflation adjusted to 2013 dollars

1984 $15,000,000 $33,000,0001997 $30,000 $42,000April 2000 $1,000 $1,300May 2000 $640 $836August 2003 $82 $100August 2007 $48 $52March 2011 $1.80 $1.80August 2012 $0.75 $0.73December 2013 $0.12 $0.12

Complexity of Computing

• A: 10×100 B: 100×5 C: 5×50

• (AB)C vs. A(BC)

• A: N×N B: N×N C=AB

• Time Complexity: O(N3)

• Space Complexity: O(1)

Why Parallel Computing?

• Why we need every-increasing performance:– Big Data Analysis– Climate Modeling– Gaming

• Why we need to build parallel systems:– Increase the speed of integrated circuits Overheating– Increase the number of transistors Multi-Core

• Why we need to learn parallel programming:– Running multiple instances of the same program is unlikely to help.– Need to rewrite serial programs to make them parallel.

Parallel Sum1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9

0 1 2 76543 Cores

8 19 7 15 7 13 12 14

Parallel Sum1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9

0 1 2 76543 Cores

8 19 7 15 7 13 12 14

0 2 4 627 22 20 26

Prefix Scan

3 5 2 5 7 9 4 6

3 8 10 15 22 31 35 41

0 3 8 10 15 22 31 35

Original Vector

Inclusive Prefix Scan

Exclusive Prefix Scan

prefixScan[0]=A[0];for (i=1; i<N; i++) prefixScan[i]=prfixScan[i-1]+A[i];

Parallel Prefix Scan3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2

3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2

3 8 10 15 7 16 12 18 7 4 5 12 6 14 13 15

15 18 12 15

0 15 33 45

3 8 10 15 22 31 27 33 40 37 38 45 51 59 58 60

Exclusive Prefix Scan

Levels of Parallelism

• Embarrassingly Parallel– No dependency or communication between parallel tasks

• Coarse-Grained Parallelism– Infrequent communication, large amounts of computation

• Fine-Grained Parallelism– Frequent communication, small amounts of computation– Greater potential for parallelism– More overhead

• Not Parallel– Giving life to a baby takes 9 months.– Can this be done in 1 month by having 9 women?

Data Decomposition

2 Cores

Granularity

8 Cores

Coordination

• Communication– Sending partial results to other cores

• Load Balancing– Wooden Barrel Principle

• Synchronization– Race Condition

Thread A Thread B1A: Read variable V 1B: Read variable V2A: Add 1 to variable V 2B: Add 1 to variable V3A Write back to variable V 3B: Write back to variable V

Data Dependency

• Bernstein's Conditions

• Examples

1: function Dep(a, b) 2: c = a·b 3: d = 3·c 4: end function

1: function NoDep(a, b)2: c = a·b 3: d = 3·b 4: e = a+b 5: end function

Flow Dependency

Output Dependency

What is not parallel?Recurrences

for (i=1; i<N; i++) a[i]=a[i-1]+b[i];

Loop-Carried Dependence

for (k=5; k<N; k++) { b[k]=DoSomething(K); a[k]=b[k-5]+MoreStuff(k);}

Atypical Loop-Carried Dependence

wrap=a[0]*b[0];for (i=1; i<N; i++) { c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap;}

Solution

for (i=1; i<N; i++) { wrap=a[i-1]*b[i-1]; c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap;}

What is not parallel?

Induction Variables

i1=4;i2=0;for (k=1; k<N; k++) { B[i1++]=function1(k,q,r); i2+=k; A[i2]=function2(k,r,q);}

Solution

i1=4;i2=0;for (k=1; k<N; k++) { B[k+3]=function1(k,q,r); i2=(k*k+k)/2; A[i2]=function2(k,r,q);}

Types of Parallelism

• Instruction Level Parallelism

• Task Parallelism– Different tasks on the same/different sets of data

• Data Parallelism– Similar tasks on different sets of the data

• Example– 5 TAs, 100 exam papers, 5 questions– How to make it task parallel?– How to make it data parallel?

Assembly Line

15 20 5

• How long does it take to produce a single car?

• How many cars can be operated at the same time?

• How long is the gap between producing the first and the second car?

• The longest stage on the assembly line determines the throughput.

Instruction Pipeline

IF: Instruction fetch

ID: Instruction decode and register fetch

EX: Execute

MEM: Memory access

WB: Register write back

1: Add 1 to R5.

2: Copy R5 to R6.

Superscalar

Computing Models• Concurrent Computing

– Multiple tasks can be in progress at any instant.

• Parallel Computing– Multiple tasks can be run simultaneously.

• Distributed Computing– Multiple programs on networked computers work collaboratively.

• Cluster Computing– Homogenous, Dedicated, Centralized

• Grid Computing– Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed

Concurrent vs. Parallel

Job 1 Job 2

Core 1 Core 2

Job 1 Job 2

Core 1 Core 2

Job 3 Job 4Job 1 Job 2

Process & Thread• Process

– An instance of a computer program being executed

• Threads– The smallest units of processing scheduled by OS– Exist as a subset of a process.– Share the same resources from the process.– Switching between threads is much faster than switching between processes.

• Multithreading– Better use of computing resources– Concurrent execution– Makes the application more responsive

ProcessThread

Thread

Parallel Processes

Program

Process 1

Process 2

Process 3

Node 1

Node 2

Node 3

Single Program, Multiple Data

Parallel Threads

Graphics Processing Unit

CPU vs. GPU

GPU Computing Showcase

MapReduce vs. GPU• Pros:

– Run on clusters of hundreds or thousands of commodity computers.

– Can handle excessive amount of data with fault tolerance.

– Minimum efforts required for programmers: Map & Reduce

• Cons:– Intermediate results are stored in disks and transferred via network links.

– Suitable for processing independent or loosely coupled jobs.

– High upfront hardware cost and operational cost

– Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar

Parallel Computing in Matlab

for i=1:1024 A(i) = sin(i*2*pi/1024); end plot(A);

matlabpool open local 3

parfor i=1:1024 A(i) = sin(i*2*pi/1024); end plot(A);

matlabpool close

GPU Computing in Matlab

http://www.mathworks.cn/discovery/matlab-gpu.html

Cloud Computing

Everything is Cloud …

Five Attributes of Cloud Computing• Service Based– What the service needs to do is more important than how the technologies are used to

implement the solution.

• Scalable and Elastic– The service can scale capacity up or down as the consumer demands at the speed of full

automation.

• Shared– Services share a pool of resources to build economies of scale.

• Metered by Use– Services are tracked with usage metrics to enable multiple payment models.

• Uses Internet Technologies– The service is delivered using Internet identifiers, formats and protocols.

Flynn’s Taxonomy

• Single Instruction, Single Data (SISD)– von Neumann System

• Single Instruction, Multiple Data (SIMD)– Vector Processors, GPU

• Multiple Instruction, Single Data (MISD)– Generally used for fault tolerance

• Multiple Instruction, Multiple Data (MIMD)– Distributed Systems– Single Program, Multiple Data (SPMD)– Multiple Program, Multiple Data (MPMD)

Flynn’s Taxonomy

Von Neumann Architecture

Harvard Architecture

Inside a PC ...

Front-Side Bus (Core 2 Extreme)

8B × 400MHZ × 4/Cycle = 12.8GB/S

Memory (DDR3-1600)

8B × 200MHZ × 4 × 2/Cycle = 12.8GB/S

PCI Express 3.0 (×16)

1GB/S × 16= 16GB/S

Shared Memory System

CPU CPU CPU CPU

Interconnect

Memory

Non-Uniform Memory Access

Core 1 Core 2

Interconnect

Memory

Core 1 Core 2Remote Access

Local Access Local Access

Interconnect

Memory

Distributed Memory System

Memory

Communication Networks

Memory

Crossbar Switch

P1 P2 P3 P4

• Component that transparently stores data so that future requests for that data can be served faster

– Compared to main memory: smaller, faster, more expensive– Spatial Locality– Temporal Locality

• Cache Line– A block of data that is accessed together

• Cache Miss– Failed attempts to read or write a piece of data in the cache– Main memory access required– Read Miss, Write Miss– Compulsory Miss, Capacity Miss, Conflict Miss

Writing Policies

Cache Mapping

Index0

Direct Mapped 2-Way Associative

Memory Cache Memory Cache

Cache Miss

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Row Major

or#define MAX 4double A[MAX][MAX], x[MAX], y[MAX];

/* Initialize A and x, assign y=0 */

for (i=0; i<MAX, i++) for (j=0; j<MAX; j++) y[i]+=A[i][j]*x[j];

/* Assign y=0 */

for (j=0; j<MAX, j++) for (i=0; i<MAX; i++) y[i]+=A[i][j]*x[j];

Cache Memory

How many hit misses?

Cache CoherenceCore 0

Cache 0

Core 1

Cache 1

Interconnect

x=2 y1

Time Core 0 Core 10 y0=x; y1=3*x;

1 x=7; Statements without x

2 Statements without x z1=4*x;

What is the value of z1?

With write through policy …

With write back policy …

Cache Coherence

Core 0

Core 1

A=5 B=A2

update A reload Ainvalidate

(A=5)B

Core 0

Core 1

A=5 B=B+1

update AB reload ABinvalidate

A and B are called false sharing.

False Sharing

int i, j, m, n;double y[m];

/* Assign y=0 */

for (i=0; i<m; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

/* Private variables */int i, j, iter_count;

/* Shared variables */int m, n, core_count;double y[m];

iter_count=m/core_count;

/* Core 0 does this */for (i=0; i<iter_count; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

/* Core 1 does this */for (i=iter_count; i<2*iter_count; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

m=8, two cores

cache line: 64 bytes

Virtual Memory• Virtualization of various forms of computer data storage

into a unified address space– Logically increases the capacity of main memory (e.g.,

DOS can only access 1 MB of RAM).

• Page– A block of continuous virtual memory addresses– The smallest unit to be swapped in/out of main memory

from/into secondary storage.

• Page Table– Used to store the mapping between virtual addresses

and physical addresses.

• Page Fault– The accessed page is not in the physical memory.

Interleaving Statements

s1 s1 s1 s1 s1s1

NMNMCM

Critical Region

• A portion of code where shared resources are accessed and updated

• Resources: data structure (variables), device (printer)

• Threads are disallowed from entering the critical region when another thread is occupying the critical region.

• A means of mutual exclusion is required.

• If a thread is not executing within the critical region, that thread must not prevent another thread seeking entry from entering the region.

• We consider two threads and one core in the following examples.

First Attempt

int threadNumber = 0;

void ThreadZero(){ while (TRUE) do { while (threadNumber == 1) do {} // spin-wait CriticalRegionZero; threadNumber=1; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { while (threadNumber == 0) do {} // spin-wait CriticalRegionOne; threadNumber=0; OtherStuffOne; }}

• Q1: Can T1 enter the critical region more times than T0?

• Q2: What would happen if T0 terminates (by design or by accident)?

Second Attempt

int Thread0inside = 0;int Thread1inside = 0;

void ThreadZero(){ while (TRUE) do { while (Thread1inside) do {} Thread0inside = 1; CriticalRegionZero; Thread0inside = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { while (Thread0inside) do {} Thread1inside = 1; CriticalRegionOne; Thread1inside = 0; OtherStuffOne; }}

• Q1: Can T1 enter the critical region multiple times when T0 is not within the critical region?

• Q2: Can T1 and T2 be allowed to enter the critical region at the same time?

Third Attempt

int Thread0WantsToEnter = 0;int Thread1WantsToEnter = 0;

void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do {} CriticalRegionZero; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do {} CriticalRegionOne; Thread1WantsToEnter = 0; OtherStuffOne; }}

Fourth Attempt

int Thread0WantsToEnter = 0;int Thread1WantsToEnter = 0;

void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do { Thread0WantsToEnter = 0; delay(someRandomCycles); Thread0WantsToEnter = 1; } CriticalRegionZero; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do { Thread1WantsToEnter = 0; delay(someRandomCycles); Thread1WantsToEnter = 1; } CriticalRegionOne; Thread1WantsToEnter = 0; OtherStuffOne; }}

Dekker’s Algorithmint Thread0WantsToEnter = 0, Thread1WantsToEnter = 0, favored = 0;void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do { if (favored == 1) { Thread0WantsToEnter = 0; while (favored == 1) do {} Thread0WantsToEnter = 1; } } CriticalRegionZero; favored = 1; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do { if (favored == 0) { Thread1WantsToEnter = 0; while (favored == 0) do {} Thread1WantsToEnter = 1; } } CriticalRegionOne; favored = 0; Thread1WantsToEnter = 0; OtherStuffZero; }}

Parallel Program Design• Foster’s Methodology

• Partitioning– Divide the computation to be performed and the data operated on by the computation into

small tasks.

• Communication– Determine what communication needs to be carried out among the tasks.

• Agglomeration– Combine tasks that communicate intensively with each other or must be executed sequentially

into larger tasks.

• Mapping– Assign the composite tasks to processes/threads to minimize inter-processor communication

and maximize processor utilization.

Parallel Histogram

10 2 3 4 5

data[i-1]

bin_counts[b-1]++ bin_counts[b]++

Find_bin()

Increment bin_counts

data[i] data[i+1]

Parallel Histogram

data[i-1]

loc_bin_cts[b-1]++

data[i] data[i+1]

data[i+2]

loc_bin_cts[b]++

bin_counts[b-1]+= bin_counts[b]+=

loc_bin_cts[b-1]++ loc_bin_cts[b]++

Performance

• Speedup

• Efficiency

• Scalability– Problem Size, Number of Processors

• Strongly Scalable– Same efficiency for larger N with fixed problem size

• Weakly Scalable– Same efficiency for larger N with a fixed problem size per processor

Parallel

Serial

Parallel

Serial

Amdahl's Law

Gustafson's Law

baTParallel bNaTSerial

sequential parallel

baaforNN

babNaNS

• Linear speedup can be achieved when:– Problem size is allowed to grow monotonously with N.– The sequential part is fixed or grows slowly.

• Is it possible to achieve super linear speedup?

Review

• Why is parallel computing important?

• What is data dependency?

• What are the benefits and issues of fine-grained parallelism?

• What are the three types of parallelism?

• What is the difference between concurrent and parallel computing?

• What are the essential features of cloud computing?

• What is Flynn’s Taxonomy?

Review

• Name the four categories of memory systems.

• What are the two common cache writing policies?

• Name the two types of cache mapping strategies.

• What is a cache miss and how to avoid it?

• What may cause the false sharing issue?

• What is a critical region?

• How to verify the correctness of a concurrent program?

Review

• Name three major APIs for parallel computing.

• What are the benefits of GPU computing compared to MapReduce?

• What is the basic procedure of parallel program design?

• What are the key performance factors in parallel programming?

• What is a strongly/weakly scalable parallel program?

• What is the implication of Amdahl's Law?

• What does Gustafson's Law tell us?

Advanced Computing Techniques & Applications

Documents

Parallel Computing Models & Techniques

MOUNTAINEERING TECHNIQUES (ADVANCED) - · PDF fileMOUNTAINEERING TECHNIQUES (ADVANCED) ... MOUNTAINEERING TECHNIQUES ON GLACIERS AND SNOW- ... Mountaineering Techniques (BASIC)

Prediction Techniques in Advanced Computing Architectures

Advanced Programming Techniques

Advanced Research Computing

Advanced JavaScript Techniques

Computing Advanced Initiative

Advanced Discoverer Techniques

Advanced Higher Computing

1 Graphical User Interface Design CIS*2450 Advanced Computing Techniques

Advanced Computing Data

Advanced WB Techniques

G51CUA - Introduction to Advanced Computing Topics Advanced Computing Visual Computing Neural computing

Advanced SEO Techniques

Astrometrica - Advanced Techniques

Prediction Techniques in Advanced Computing Architectureswebspace.ulbsibiu.ro/lucian.vintan/html/Book_2007.pdf · 6 Prediction Techniques in Advanced Computing Architectures Vivid

Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Advanced PPC Techniques

Advanced educational computing

Maya Advanced Techniques - wam.autodesk.comwam.autodesk.com/adsk/files/nparticlesadvancedtechniques2011.pdf · Autodesk Maya Advanced Techniques ... nParticles Advanced Techniques