27
Parallel Programming Architectures Prof. Paolo Bientinesi HPAC, RWTH Aachen [email protected] WS18/19

Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Parallel ProgrammingArchitectures

Prof. Paolo Bientinesi

HPAC, RWTH [email protected]

WS18/19

Page 2: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Quick architecture review

Prof. Paolo Bientinesi | Parallel Programming 2 / 11

Page 3: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Clock, cycle, frequency

Clock determines when events take place in the hardware

Frequency (or clock rate): # of cycles per second.For instance: 2GHz→ 2× 109 cycles per second

Prof. Paolo Bientinesi | Parallel Programming 3 / 11

Page 4: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Clock, cycle, frequency

Clock determines when events take place in the hardware

Frequency (or clock rate): # of cycles per second.For instance: 2GHz→ 2× 109 cycles per second

Prof. Paolo Bientinesi | Parallel Programming 3 / 11

Page 5: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Clock, cycle, frequency

Clock determines when events take place in the hardware

Frequency (or clock rate): # of cycles per second.For instance: 2GHz→ 2× 109 cycles per second

Prof. Paolo Bientinesi | Parallel Programming 3 / 11

Page 6: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Clock, cycle, frequency

Clock determines when events take place in the hardware

Frequency (or clock rate): # of cycles per second.For instance: 2GHz→ 2× 109 cycles per second

Prof. Paolo Bientinesi | Parallel Programming 3 / 11

Page 7: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Basic processor architecture

Source: Computer organization and design. Patterson, Hennessy.

Instruction Fetch (IF): read instruction from cacheInstruction Decode (ID): read register dataExecute (EX): execute arithmetic/logic operationStore (ST): store the result

Prof. Paolo Bientinesi | Parallel Programming 4 / 11

Page 8: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Basic processor architecture

Source: Computer organization and design. Patterson, Hennessy.

Instruction Fetch (IF): read instruction from cache

Instruction Decode (ID): read register dataExecute (EX): execute arithmetic/logic operationStore (ST): store the result

Prof. Paolo Bientinesi | Parallel Programming 4 / 11

Page 9: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Basic processor architecture

Source: Computer organization and design. Patterson, Hennessy.

Instruction Fetch (IF): read instruction from cacheInstruction Decode (ID): read register data

Execute (EX): execute arithmetic/logic operationStore (ST): store the result

Prof. Paolo Bientinesi | Parallel Programming 4 / 11

Page 10: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Basic processor architecture

Source: Computer organization and design. Patterson, Hennessy.

Instruction Fetch (IF): read instruction from cacheInstruction Decode (ID): read register dataExecute (EX): execute arithmetic/logic operation

Store (ST): store the result

Prof. Paolo Bientinesi | Parallel Programming 4 / 11

Page 11: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Basic processor architecture

Source: Computer organization and design. Patterson, Hennessy.

Instruction Fetch (IF): read instruction from cacheInstruction Decode (ID): read register dataExecute (EX): execute arithmetic/logic operationStore (ST): store the result

Prof. Paolo Bientinesi | Parallel Programming 4 / 11

Page 12: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

The laundry analogy

Source: Computer organization and design. Patterson, Hennessy.

Latency: 1 load takes 2 hours

Throughput: n loads take 2 ∗ n hours⇒ 12 load per hour

How can we improve the throughput? Pipelining

Prof. Paolo Bientinesi | Parallel Programming 5 / 11

Page 13: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

The laundry analogy

Source: Computer organization and design. Patterson, Hennessy.

Latency: 1 load takes 2 hours

Throughput: n loads take 2 ∗ n hours⇒ 12 load per hour

How can we improve the throughput? Pipelining

Prof. Paolo Bientinesi | Parallel Programming 5 / 11

Page 14: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

The laundry analogy

Source: Computer organization and design. Patterson, Hennessy.

Latency: 1 load takes 2 hours

Throughput: n loads take 2 ∗ n hours⇒ 12 load per hour

How can we improve the throughput? Pipelining

Prof. Paolo Bientinesi | Parallel Programming 5 / 11

Page 15: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelining

Source: Computer organization and design. Patterson, Hennessy.

Latency: still 2 hours

Throughput: ( n2+(n−1)∗0.5 ) loads per hour ( 4

3.5 ≈ 1.14 loads/hour)

limn→∞ Throughput = 2 (vs original 12 )

Prof. Paolo Bientinesi | Parallel Programming 6 / 11

Page 16: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelining

Source: Computer organization and design. Patterson, Hennessy.

Latency: still 2 hours

Throughput: ( n2+(n−1)∗0.5 ) loads per hour ( 4

3.5 ≈ 1.14 loads/hour)

limn→∞ Throughput = 2 (vs original 12 )

Prof. Paolo Bientinesi | Parallel Programming 6 / 11

Page 17: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelining

Source: Computer organization and design. Patterson, Hennessy.

Latency: still 2 hours

Throughput: ( n2+(n−1)∗0.5 ) loads per hour ( 4

3.5 ≈ 1.14 loads/hour)

limn→∞ Throughput = 2 (vs original 12 )

Prof. Paolo Bientinesi | Parallel Programming 6 / 11

Page 18: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelined processors

Instr 1 Instr 2 Instr 3

IF ID EX ST IF ID EX ST IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

Each step is known as stage4-stage pipeline

Prof. Paolo Bientinesi | Parallel Programming 7 / 11

Page 19: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelined processors

Instr 1 Instr 2 Instr 3

IF ID EX ST IF ID EX ST IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

Each step is known as stage4-stage pipeline

Prof. Paolo Bientinesi | Parallel Programming 7 / 11

Page 20: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Pipelined processors

Instr 1 Instr 2 Instr 3

IF ID EX ST IF ID EX ST IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

Each step is known as stage4-stage pipeline

Prof. Paolo Bientinesi | Parallel Programming 7 / 11

Page 21: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Throughput

Program P: n instructionss-stage pipelinelatency( stage ) = k secs

Single resource⇒ serial execution

Latency( P ) = nsk secs

Throughput:n

nsk = 1sk instr/sec

Multiple resources⇒ pipelined execution

Latency( P ) = sk + (n − 1)k secs

Throughput:limn→∞

nsk+(n−1)k = 1

k instr/sec

Morale: The throughput for the pipelined execution is s times as large as theone for the serial execution. Also, the more stages (the larger s), the smallerk , and the higher the throughput.

Prof. Paolo Bientinesi | Parallel Programming 8 / 11

Page 22: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Throughput

Program P: n instructionss-stage pipelinelatency( stage ) = k secs

Single resource⇒ serial execution

Latency( P ) = nsk secs

Throughput:n

nsk = 1sk instr/sec

Multiple resources⇒ pipelined execution

Latency( P ) = sk + (n − 1)k secs

Throughput:limn→∞

nsk+(n−1)k = 1

k instr/sec

Morale: The throughput for the pipelined execution is s times as large as theone for the serial execution. Also, the more stages (the larger s), the smallerk , and the higher the throughput.

Prof. Paolo Bientinesi | Parallel Programming 8 / 11

Page 23: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Throughput

Program P: n instructionss-stage pipelinelatency( stage ) = k secs

Single resource⇒ serial execution

Latency( P ) = nsk secs

Throughput:n

nsk = 1sk instr/sec

Multiple resources⇒ pipelined execution

Latency( P ) = sk + (n − 1)k secs

Throughput:limn→∞

nsk+(n−1)k = 1

k instr/sec

Morale: The throughput for the pipelined execution is s times as large as theone for the serial execution. Also, the more stages (the larger s), the smallerk , and the higher the throughput.

Prof. Paolo Bientinesi | Parallel Programming 8 / 11

Page 24: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Multiple-issue processors

Replicate internal components to launch multiple instructions per cycle

Allows instruction execution rate > clock rate

That is, allows to complete the execution of more than one IPC

IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

IF ID EX ST

Prof. Paolo Bientinesi | Parallel Programming 9 / 11

Page 25: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Towards the multi-core eraLimitations in ILP

Trends in multiple-issue processors.

486 Pentium Pentium II Pentium 4 Itanium Itanium 2 Core2

Year 1989 1993 1998 2001 2002 2004 2006Width 1 2 3 3 3 6 4

High-performance processors:Issue width has stabilized at 4-6Alpha 21464 (8-way) was canceled (2001).Need hardware/compiler scheduling to exploit the width

Embedded/Low-power processors:Typical width of 2Simpler architectures, no advanced scheduling

Prof. Paolo Bientinesi | Parallel Programming 10 / 11

Page 26: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Towards the multi-core eraLimitations in ILP

Trends in multiple-issue processors.

486 Pentium Pentium II Pentium 4 Itanium Itanium 2 Core2

Year 1989 1993 1998 2001 2002 2004 2006Width 1 2 3 3 3 6 4

High-performance processors:Issue width has stabilized at 4-6Alpha 21464 (8-way) was canceled (2001).Need hardware/compiler scheduling to exploit the width

Embedded/Low-power processors:Typical width of 2Simpler architectures, no advanced scheduling

Prof. Paolo Bientinesi | Parallel Programming 10 / 11

Page 27: Parallel Programming - Architectureshpac.rwth-aachen.de/teaching/pp-18/lectures/architecture.pdfn→∞Throughput = 2 (vs original 1 2) Prof. Paolo Bientinesi | Parallel Programming

Towards the multi-core eraLimitations in ILP

Microarchitecture Pipeline stages

i486 3P5 (Pentium) 5P6 (Pentium Pro/II) 14P6 (Pentium 3) 8P6 (Pentium M) 10NetBurst (Northwood) 20NetBurst (Prescott) 31Core 12Nehalem 20Sandy Bridge 14Haswell 14

Table: Evolution of the pipeline depth for a sample of Intel microarchitectures.Source: wikipedia.org

Prof. Paolo Bientinesi | Parallel Programming 11 / 11