45
Parallel Architecture/Programming Hung-Wei Tseng

Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Parallel Architecture/ProgrammingHung-Wei Tseng

Page 2: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Von Neumann architecture

2

memory

2

8

3

CPU is a dominant factor of performance since we heavily rely on it to execute programs

By pointing “PC” to different part of your memory, we can perform different functions!

Page 3: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

3

CPU performance scales well before 2002History of Processor Performance

1Tuesday, April 24, 12

52%/year

Page 4: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Using PCIe P2P to eliminate CPU/DRAM from the data plane• Reduce the CPU load• Avoid redundant data copies in DRAM• Improve the latency

4

The slowdown of CPU scaling

SPEC

Rate

0

10

20

30

40

50

60

70

80

90

Sep-

06De

c-06

Mar

-07

Jun-

07Se

p-07

Dec-

07M

ar-0

8Ju

n-08

Sep-

08De

c-08

Mar

-09

Jun-

09Se

p-09

Dec-

09M

ar-1

0Ju

n-10

Sep-

10De

c-10

Mar

-11

Jun-

11Se

p-11

Dec-

11M

ar-1

2Ju

n-12

Sep-

12De

c-12

Mar

-13

Jun-

13Se

p-13

Dec-

13M

ar-1

4Ju

n-14

Sep-

14De

c-14

Mar

-15

Jun-

15Se

p-15

Dec-

15M

ar-1

6Ju

n-16

Sep-

16De

c-16

Mar

-17

Jun-

17Se

p-17

5x in 67 months

1.5x in 67 months

Page 5: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

5

Intel P4(2000)1 core

Intel Nahalem(2010)4 cores

Nvidia Tegra 3(2011)5 cores

SPARC T3(2010)

16 cores

AMD Zambezi(2011)

16 cores

AMD Athlon 64 X2 (2005)2 cores

Page 6: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Die photo of a CMP processor

6

Page 7: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Each processor has its own local cache

7

Memory hierarchy on CMP

Core 0

Local $

Core 1

Local $

Core 2

Local $

Core 3

Local $

Bus

Shar

ed $

Page 8: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Assuming both application X and application Y have similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors:P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache P2: SMT with a 4-issue pipeline. 64KB L1 D-cache Which one do you think is better?A. P1B. P2

8

Comparing SMT and CMP

Page 9: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Parallelism

9

Page 10: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Instruction-level parallelism• Data-level parallelism• Thread-level parallelism

10

Parallelism in modern computers

Page 11: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar

• SIMD — single instruction stream, multiple data• Vector instructions• GPUs

• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading

11

Processing models

Page 12: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• SISD — single instruction stream, single data• Pipelining instructions within a single program• Superscalar

• SIMD — single instruction stream, multiple data• Vector instructions• GPUs

• MIMD — multiple instruction stream (e.g. multiple threads, multiple processes), multiple data• Multicore processors• Multiple processors• Simultaneous multithreading

12

Processing models

We will focus on this today

Page 13: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• How can we create programs to utilize these cores?A. Parallel programming is easy, programmers will just write parallel programs from now on.B. Parallel programming was hard, but architects have generally solved this problem in the

10 years since we saw the problemC. You don’t need to write parallel code, Intel’s new compilers know how to extract thread

level parallelismD. Intel (and everyone else) is just building the chips, it’s on you to figure out how to use

them.

13

How to utilize these cores?

Page 14: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Speedup an application with multi-threaded programming models

14

Page 15: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• To exploit parallelism you need to break your computation into multiple “processes” or multiple “threads”

• Processes (in OS/software systems)• Separate programs actually running (not sitting idle) on your computer at the same time.• Each process will have its own virtual memory space and you need explicitly exchange

data using inter-process communication APIs• Threads (in OS/software systems)

• Independent portions of your program that can run in parallel• All threads share the same virtual memory space

• We will refer to these collectively as “threads”• A typical user system might have 1-8 actively running threads.• Servers can have more if needed (the sysadmins will hopefully configure it that way)

15

Parallel programming

Page 16: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Multi-processed model

16

Physical memory

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

ProcessCode

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

ProcessCode

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

ProcessCode

• You can use fork() to create a child process• You need to use files/sockets/MPI to exchange data

Page 17: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Multi-processed model

17

Physical memory

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

• You can use fork() to create a child process• You need to use files/sockets/MPI to exchange data

Page 18: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Multi-threaded model

18

Physical memory

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

Process

Thread

Thread

Thread

Page 19: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Multi-threaded model

19

Physical memory

0x000000000000

0xFFFFFFFFFFFF

Code

Static Data

Data

Heap

Stack

Page 20: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• All threads within the same process share the same memory space• You may use pthread_create to spawn a thread

20

POSIX threads

/* Do matrix multiplication */ for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

Spawn a thread

Synchronize and wait a for thread to terminate

Page 21: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Demo — vector add

21

for(i = 0; i < ARRAY_SIZE; i+=4) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); }

NUM_OF_THREADS = ARRAY_SIZE/4; thread = (pthread_t *)malloc(sizeof(pthread_t)*NUM_OF_THREADS); tids = (int *)malloc(sizeof(pthread_t)*NUM_OF_THREADS);

for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_vadd, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

void *threaded_vadd(void *thread_id){ int tid = *(int *)thread_id; int i = tid * 4; va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); return NULL;}

Page 22: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Cache and shared memory model

22

Page 23: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Provide a single memory space that all processors can share• All threads within the same program shares the same address space.• Threads communicate with each other using shared variables in memory• Provide the same memory abstraction as single-thread programming

23

Supporting POSIX threads

Page 24: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Coherency• Guarantees all processors see the same value for a variable/memory address in the

system when the processors need the value at the same time• What value should be seen

• Consistency• All threads see the change of data in the same order

• When the memory operation should be done

24

Cache on Multiprocessor

Page 25: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Snooping protocol• Each processor broadcasts / listens to cache misses

• State associate with each block (cacheline)• Invalid

• The data in the current block is invalid• Shared

• The processor can read the data• The data may also exist on other processors

• Exclusive• The processor has full permission on the data• The processor is the only one that has up-to-date data

25

Simple cache coherency protocol

Page 26: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

stat

e

tag datadirty

stat

e

tag datadirty

Accessing the cache

hit? miss?hit? miss?

=?=?

26

tag offsetindex1000 0000 0000 0000 0000 0001 0101 1000memory address:

Offset:The position of the

requesting word in a cache blockHit: The data was found in the cache

Miss: The data was not found in the cache

1000 0001 0000 1000 00001 0 1000 0000 0000 0000 00001 1

memory address: 0x8 0 0 0 0 1 5 8

Page 27: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Simple cache coherency protocol

Invalid Shared

Exclusive

read miss(processor)

writ

e m

iss(

proc

esso

r)write miss(bus)

write re

quest(processo

r)

writ

e m

iss(

bus)

writ

e ba

ck d

ata

read miss(bus)

write back

data

read miss/hit

read/write miss (bus)

write hit

27

Page 28: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• What happens when core 0 modifies 0x1000?, which belongs to the same cache block as 0x1000?

28

Cache coherency practice

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Shared 0x1000 Shared 0x1000 Shared 0x1000Excl. 0x1000 Invalid 0x1000 Invalid 0x1000 Invalid 0x1000

Write miss 0x1000

invalidate

Page 29: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Then, what happens when core 2 reads 0x1000?

29

Cache coherency practice

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Invalid 0x1000 Shared 0x1000 Invalid 0x1000Excl. 0x1000 Invalid 0x1000

Read miss 0x1000Write back 0x1000Fetch 0x1000

Page 30: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Assuming that we are running the following code on a CMP with a cache coherency protocol, how many of the following outputs are possible? (a is initialized to 0 as assume we will output more than 10 numbers)

① 0 1 2 3 4 5 6 7 8 9② 1 2 5 9 3 6 8 10 12 13③ 1 1 1 1 1 1 1 1 64 100 ④ 1 1 1 1 1 1 1 1 1 100A. 0B. 1C. 2D. 3E. 4

30

Cache coherency

thread 1 thread 2while(1) printf(“%d ”,a);

while(1) a++;

Page 31: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Demo!

31

It’s show time!

thread 1 thread 2while(1) printf(“%d ”,a);

while(1) a++;

Page 32: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Assuming that we are running the following code on a CMP with a cache coherency protocol, how many of the following outputs are possible? (a is initialized to 0)

① 0 1 2 3 4 5 6 7 8 9② 1 2 5 9 3 6 8 10 12 13③ 1 1 1 1 1 1 1 1 64 100 ④ 1 1 1 1 1 1 1 1 1 100A. 0B. 1C. 2D. 3E. 4

32

Cache coherency

thread 1 thread 2while(1) printf(“%d ”,a);

while(1) a++;

Page 33: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Observer

33

thread 1 thread 2int loop;int main(){ pthread_t thread; loop = 1; pthread_create(&thread, NULL, modifyloop, NULL); while(loop == 1) { continue; } pthread_join(thread, NULL); fprintf(stderr,"User input: %d\n", loop); return 0;}

void* modifyloop(void *x){ sleep(1); printf("Please input a number:\n"); scanf("%d",&loop); return NULL;}

Page 34: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

Observer

34

thread 1 thread 2volatile int loop;int main(){ pthread_t thread; loop = 1; pthread_create(&thread, NULL, modifyloop, NULL); while(loop == 1) { continue; } pthread_join(thread, NULL); fprintf(stderr,"User input: %d\n", loop); return 0;}

void* modifyloop(void *x){ sleep(1); printf("Please input a number:\n"); scanf("%d",&loop); return NULL;}

prevents the compiler from putting loop in the register

Page 35: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Comparing implementations of thread_vadd — L and R, please identify which one will be performing better and why

A. L is better, because the cache miss rate is lowerB. R is better, because the cache miss rate is lowerC. L is better, because the instruction count is lowerD. R is better, because the instruction count is lowerE. Both are about the same

35

Performance comparison

for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_vadd, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid * 4; i < ARRAY_SIZE; i+=4*NUM_OF_THREADS) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid*(ARRAY_SIZE/NUM_OF_THREADS); i < (tid+1)*(ARRAY_SIZE/NUM_OF_THREADS); i+=4) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

Version L Version R

Main thread

Page 36: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

L v.s. R

36

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid * 4; i < ARRAY_SIZE; i+=4*NUM_OF_THREADS) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid*(ARRAY_SIZE/NUM_OF_THREADS); i < (tid+1)*(ARRAY_SIZE/NUM_OF_THREADS); i+=4) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

Version L Version R

a a

Page 37: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Assume a[0]’s address is 0x1000• Now, core 1 updates a[4]-a[7] (address: 0x1010-0x101F, which belongs the

same block as 0x1000?

37

Cache coherency practice

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Invalid 0x1000Shared 0x1000 Invalid 0x1000Invalid 0x1000 Invalid 0x1000Excl. 0x1000 Invalid 0x1000

Write miss 0x1010

• Then, if Core 0 accesses a[0]-a[3], starting from 0x1000, it will be a miss!

Read miss 0x1000

Page 38: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Comparing implementations of thread_vadd — L and R, please identify which one will be performing better and why

A. L is better, because the cache miss rate is lowerB. R is better, because the cache miss rate is lowerC. L is better, because the instruction count is lowerD. R is better, because the instruction count is lowerE. Both are about the same

38

Performance comparison

for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_vadd, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid * 4; i < ARRAY_SIZE; i+=4*NUM_OF_THREADS) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

void *threaded_vadd(void *thread_id){ __m128 va, vb, vt; int tid = *(int *)thread_id; int i = tid * 4; for(i = tid*(ARRAY_SIZE/NUM_OF_THREADS); i < (tid+1)*(ARRAY_SIZE/NUM_OF_THREADS); i+=4) { va = _mm_load_ps(&a[i]); vb = _mm_load_ps(&b[i]); vt = _mm_add_ps(va, vb); _mm_store_ps(&c[i],vt); } return NULL;}

Version L Version R

Main thread

Page 39: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• 3Cs:• Compulsory, Conflict, Capacity

• Coherency miss:• A “block” invalidated because of the sharing among processors.

39

4C model

Page 40: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• True sharing• Processor A modifies X, processor B also want to access X.

• False Sharing• Processor A modifies X, processor B also want to access Y. However, Y is invalidated

because X and Y are in the same block!

40

Types of coherence misses

Page 41: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see?① (0, 0)② (0, 1)③ (1, 0)④ (1, 1)A. 0B. 1C. 2D. 3E. 4

41

Again — how many values are possible?

#include <stdio.h>#include <stdlib.h>#include <pthread.h>#include <unistd.h>

volatile int a,b;volatile int x,y;volatile int f;void* modifya(void *z) { a=1; x=b; return NULL;}void* modifyb(void *z) { b=1; y=a; return NULL;}

int main() { int i; pthread_t thread[2]; pthread_create(&thread[0], NULL, modifya, NULL); pthread_create(&thread[1], NULL, modifyb, NULL); pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); fprintf(stderr,”(%d, %d)\n",x,y); return 0;}

Page 42: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Processor/compiler may reorder your memory operations/instructions• Coherence protocol can only guarantee the update of the same memory address• Processor can serve memory requests without cache miss first• Compiler may store values in registers and perform memory operations later

• Each processor core may not run at the same speed (cache misses, branch mis-prediction, I/O, voltage scaling and etc..)

• Threads may not be executed/scheduled right after it’s spawned

42

Why (0,0)?

Page 43: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see?① (0, 0)② (0, 1)③ (1, 0)④ (1, 1)A. 0B. 1C. 2D. 3E. 4

43

Again — how many values are possible?

#include <stdio.h>#include <stdlib.h>#include <pthread.h>#include <unistd.h>

volatile int a,b;volatile int x,y;volatile int f;void* modifya(void *z) { a=1; x=b; return NULL;}void* modifyb(void *z) { b=1; y=a; return NULL;}

int main() { int i; pthread_t thread[2]; pthread_create(&thread[0], NULL, modifya, NULL); pthread_create(&thread[1], NULL, modifyb, NULL); pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); fprintf(stderr,”(%d, %d)\n",x,y); return 0;}

Page 44: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• x86 provides an “mfence” instruction to prevent reordering across the fence instruction

44

fence instructions

thread 1 thread 2

a=1;

x=b;

b=1;

y=a;mfence mfencea=1 must occur/update before mfence b=1 must occur/update before mfence

You won’t see (0,0) at least…

• x86 only supports this kind of “relaxed consistency” model. You still have to be careful enough to make sure that your code behaves as you expected

Page 45: Parallel Architecture/Programmingcseweb.ucsd.edu/.../slides/16_ParallelArchitectures.pdf · 2019-09-04 · •Using PCIe P2P to eliminate CPU/DRAM from the data plane • Reduce the

• Processor behaviors are non-deterministic • You cannot predict which processor is going faster• You cannot predict when OS is going to schedule your thread

• Cache coherency only guarantees that everyone would eventually have a coherent view of data, but not when

• Cache consistency is hard to support

45

Why is parallel programming hard?