Upload
lamdieu
View
212
Download
0
Embed Size (px)
Citation preview
1Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Beyond Instruction Level
Parallelism
2Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Summary of Superscalar Processing
IF ID
DataMemory
InstructionMemory
InstructionPool
ReorderBuffer
EX
Load
EX
Store
EX
Single CPU
Virtual registers andarchitectural registers
prevent false dependencies
Multiple execution units
Multiple instructionsissued per CC
from instruction pool
Branch predictionand trace cacheminimize branch
penalties
Out-of-Order ExecutionIn-Order Retirement
Registers
Predicationfor conditionalcancellation
of instructions
Prefetchminimizes
cache misses
Stream bufferminimizes
cache misses
3Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
ILP Scalability Limit
Require larger source of independent instructionsExploit inherent parallelism in software operations
( )( )
2 2'' '
1'
execution units
pipeline stages
instruction window
i i u is uideal ideal
i i s is u
EU EU u s EU
u u us u
s s ss u
IC IC IC
αβ α
β λ λβ α
α β
→ =×
→ = → =+ ×
→ =
Scaling instruction window and decoder rate
Difficulties Decode 15 instructions per CC
Despite cache misses, mispredictions, …Maintain window of 120 independent instructions
Branches ≈ 20% of instructions 25 – 30 branches in window ⇒ large misprediction probability
15 8 106 2
120
15 14.9
instructions executing in parallel
instructions decoded per CC
u s u s
EU
ideal
IC
α β α β
λ
→ →
= = ⇒ × =
=
> ≥
Scaling 6 15 EUs with 2 8 superpipelined stages
4Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Sequential and Parallel Operations Programs combine parallel + sequential constructs
High-level job → model-dependent sectionsProcessesThreadsClassesProceduresControl blocks
Sections compiled → ISA = low level CPU operationsData transfersArithmetic/logic operationsControl operations
High-level job → executionMachine instructions — small sequential operations
Local information on 2 or 3 operands
CPU cannot recognize abstract model-dependent structuresInformation about inherent parallelism lost in translation to CPU
5Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Parallelism in Sequential Jobs Concurrency in high-level job
Two or more independent activities in defined to execute at same timeParallel — execute simultaneously on multiple copies of hardware Interleave — single hardware unit alternates between activities
Example Respond to mouse eventsRespond to keyboard inputAccept network message
Functional concurrencyProcedure maps A' = R(θ) × ACode performs sequential operations
Ax' = Ax cos θ + Ay sin θAy' = -Ax sin θ + Ay cos θ
Data concurrencyProcedure maps C = A + BCode performs sequential operations
for (i = 0, i < n, i++) C[i] = A[i] + B[i]
A
BC
θA
A'
6Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Extracting Concurrency in Sequential ProgrammingProgrammer
Codes in high level languageCode reflects abstract programming models
Procedural, object oriented, frameworks, structures, system calls, ...
Compiler Converts high level code to sequential list
Localized CPU instructions and operands
Information about inherent parallelism lost in translationHardware applies heuristics
Partially recover concurrency as ILP
Concurrency Identified / Reconstructed Technique
Parallelism in single instruction executionPipelining
Operation independenceDynamic scheduling superscalar
Control blocksBranch and trace prediction
Decision treesPredication
7Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Extracting Parallelism in Parallel ProgrammingProgrammer
Identifies inherently parallel operations in high level jobFunctional concurrencyData concurrency
Translates parallel algorithm into source codeSpecifies parallel operations to compiler
Parallel threads for functional decompositionParallel threads for data decomposition
Hardware Receives deterministic instructions reflecting inherent parallelism
Code + threading instructions
Disperses instructions to multiple processors or execution unitsVectorized operationsPre-grouped independent operationsThread Level Parallelism
8Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
The "Old" Parallel Processing1958 — research at IBM on parallelism in arithmetic operations1960 – 1980
Mainframe SMP machines with N = 4 to 24 CPUsOS dispatches process from shared ready queue to idle processor
1980 – 1995Research boomAutomated parallelization by compiler
Limited success — compilers cannot identify inherent parallelism
Parallel constructs in high level languagesLong learning curve — parallel programmers are typically specialists
Inherent complexitiesProcessing and communication overhead
Inter-process message passing — spawning/assembling with many CPUsSynchronization to prevent race conditions (data hazards)
Data structuresShared memory modelGood blocking to cache organization
1999 — fashionable to consider parallel processing a dead end
9Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Rise and Fall of Multiprocessor R&D
Ref: Mark D. Hill and Ravi Rajwar, "The Rise and Fall of Multiprocessor Papers in the International Symposium on Computer Architecture (ISCA)",http://pages.cs.wisc.edu/~markhill/mp2001.html
Topics of papers submitted to ISCA1973 to 2001
Sorted as percent of total
ISCA — International Symposium on Computer Architecture
Hennessey and Patterson joke that proper place formultiprocessing in their book is Chapter 11 (a section of USbusiness law on bankruptcy)
10Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
It's Back — the "New" Parallel ProcessingCrisis rebranded as opportunity
Processor clock speed near physical limit (speed of light = 3 × 1010 cm/s)
HeatingClock rate ↑ ⇒ heat output ↑CPU power ↑ ⇒ chip size ↑ ⇒ heat transfer rate ↓ ⇒ CPU overheats
Superscalar ILP cannot rise significantlyInstruction window ~ 100 independent instructions
"Old" parallel processing is not sufficientSome interesting possibilities
Multicore processors cheaper and easier to manufactureUser level thread managementMultithreaded OS kernels and OS level thread schedulingCompiler support for thread management APIsNew debugging tools
CPU
3 cm
in out
in
out
10delay 10
10 10clock max
3 cm10 sec
3 10 cm/sec
10 sec R 10 Hz 10 GHz
−
−
τ > =×
τ ≈ ⇒ < ≈
delayτ
11Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Processes and ThreadsProcess
One instance of an independently executable programBasic unit of OS kernel scheduling (on traditional kernel)Entry in process control block (PCB) defines resources
ID, state, PC, register values, stack+memory space, I/O descriptors, …
Process context switch → high volume transfer operationOrganized into one or more owned threads
Thread One instance of independently executable instruction sequenceNot organized into smaller multitasked unitsLimited private resources — PC, stack, and register values
Other resources shared with other threads owned by process
Scheduled by kernel or threaded user codeThread switch → low volume transfer operation
12Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Multithreaded SoftwareThreaded OS kernel
Process = one or more threadsMultithreaded application
Organized as more than one threadThreads scheduled by OS or application codeNot specific to parallel algorithms
Classic multithreading exampleMultithreaded web server
Serves multiple clientsCreates thread per client
Server process creates listen threadListen thread blocks —waits for service requestService request → listen thread creates new serve thread
Serve thread handles web service request
Listen thread returns to blocking
listen
request
serveclient
server
response
newthread
13Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Decomposing WorkDecomposition
Break down program into basic activities Identify dependencies between activities"Chunking" — choose size parameters for coded activities
Functional DecompositionEach thread assigned different activityExample — 3D game
Thread 1 updates ground Thread 2 updates skyThread 3 updates character
Data DecompositionEach thread runs same code on separate block of dataExample — 3D game
Divide sky into n sectionsThreads 1 — n update section of sky
14Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Hardware Implementation of MultithreadingNo special hardware requirements
Multithreaded code runs on single / multiple CPU systemRun-time efficiency depends on hardware/software interaction
Coarse-grained multithreadingSingle CPU swaps among threads on long stall
Fine-grained multithreadingSingle CPU swaps among threads on each clock cycle
Simultaneous multithreading (SMT)Superscalar CPU pools instructions from multiple threadsEnlarges instruction window
Hyper-ThreadingIntel technology combining fine-grained multithreading and SMT
MultiprocessingDispatches threads to CPUs
15Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Superscalar CPU MultithreadingSingle thread on superscalar
FetchDecode
clock cycles
executionunits
Issued instruction
Empty EUROB
Course grained multithreading on superscalar
FetchDecode
clock cycles
executionunits
Thread 1
Empty EU
ROB
Thread 2Thread 3
Thread 4
Fine grained multithreading on superscalar
FetchDecode
clock cycles
executionunits
Thread 1
Empty EU
ROB
Thread 2Thread 3
Thread 4
16Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Simultaneous Multithreading
FetchDecode
clock cycles
executionunits
Thread 1
Empty EU
ROB
Thread 2Thread 3
Thread 4
Simultaneous multithreading on superscalarPool instructions from multiple threadsInstructions labeled in reorder buffer (ROB)
PC Thread numberOperands Status
Large instruction window
Advantage on mispredictionsOnly thread with misprediction is cancelled
Other threads continue to execute
Cancellation rate from mispredictions → ¼ single-thread cancellation rate
17Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Hyper‐Threading
Two copies of architectural state + one execution coreFine grained N = 2 multithreadingInterleaves threads on In-Order fetch/decode/retire units
Issue instructions to shared Out-of-Order execution coreSimultaneous N = 2 multithreading (SMT)Executes instructions from shared instruction pool (ROB)
Stall in one thread ⇒ other thread continuesBoth CPUs keep working on most clock cyclesAdvantage of course-grained N = 2 multithreading
ArchitecturalState
ExecutionCore
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
Architectural StateRegisters, stack pointersand program counter
Execution CoreALU, FPU, vectorprocessors, memory unit
18Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Thread CoexistenceMultiprocessor code
Provides source of independent instructionsPermits high processor utilization
Independent applications running in parallel Unrelated instructions with no data dependenciesIndependence can create resource conflicts
Require different data blocks in cacheUse different branch prediction cache and trace cache
Parallel threads of single applicationDifferent pieces of same programRun in coordinated fashion
Communicate, synchronize, exchange data
Stall in thread can stall related threadsCache miss, page fault, branch misprediction, ...
19Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Helper Thread ModelPerforms no committed work
Do not change any program resultResults not committed to memoryRequire no additional hardware support
Performs loads and branches that appear in work threadEncounter cache misses before work threadPrepares cachesPrevents costly misses
20Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Helper Thread ExampleExample
L: MUL R4, R6, R8ADD R4, R6, R9ADD R1, R2, R3SUB R3, R4, R5
LW R6, 0(R1) ; cache miss ADD R6, R3, R2
BEZQ R6, L ; misprediction
L: ADD R1, R2, R3
LW R6, 0(R1) ; cache miss; cache update
BEQZ R6, L ; misprediction; update predictor
L: MUL R4, R6, R8ADD R4, R6, R9ADD R1, R2, R3SUB R3, R4, R5
LW R6, 0(R1) ; no cache missADD R6, R3, R2
BEZQ R6, L ; no misprediction
Helper ThreadWork Thread
21Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Flynn Taxonomy for CPU Architectures
SISD Standard single CPU machine with single or multiple pipelines
SIMDVector processor or processor arrayPerforms one operation on data set on each CC
MISDPerform multiple operations on one data set each CCFew products — IBM Watson IA applies multiple algorithms to same data
MIMDMultiprocessor or cluster computerPerform multiple operations on multiple data sets on each CC
Ref: M.J. Flynn, "Very High-Speed Computers", Proceedings of the IEEE, Dec. 1966.
Multiple Instruction Multiple Data
MIMD
Single Instruction Multiple Data
SIMD
Multiple Instruction Single Data
MISD
Single Instruction Single Data
SISDData
Instruction
22Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Multiprocessor ArchitectureSISD/SIMD workstation
Dual core CPUArchitectural registersCacheExecution units
I/O systemLong-term storagePeripheral devicesSystem support functions
Main memory Internal network system
MIMD multiprocessor Multiple CPUsI/O systemMain memory
Unified or partitioned
Internal network From simple bus to complex mesh
CPU CPU...
Memory Memory...
I/O User Interface
ExternalNetwork
InternalNetwork
זיכרון ראשי
Main Memory(RAM)
מתאם אפיק Bus Adapter
אפיק זיכרוןMemory Bus
פלט/אפיק ק לטI/O Bus
פלט /בקר קלטI/O Controller
פלט /בקר קלטI/O Controller
פלט /בקר קלטI/O Controller
Disk ממשקמשתמש
רש ת תק שורת communications
network
זיכרון מטמוןcache memory
ליבת עיבוד ואוגרים
Processor Coreand
Registers
)ליבות-דו(יחידת החישוב המרכזי Dual Core Central Processing Unit (CPU)
ליבת עיבוד ואוגרים
Processor Coreand
Registers
Front Side Bus
23Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Message Passing System
N nodes — processors with private address space A
Processors communicate by passing messages over internal network
Messages combine data and memory synchronization
Shared Memory System
Global memory space A physically partitioned into M blocks
N processors access full memory space via internal network
Processors communicate by write/read to shared addresses
Synchronize memory accesses to prevent data hazards
Network Topology → Parallelization Model
CPU CPU...
Memory Memory...
I/OUser Interface
ExternalNetwork
SwitchingFabric
0 N 1−
( ) ( )( )0,..., A/M 1 M 1 A/M ,...,A 1− − −
0 M 1−
CPU CPU...
SwitchingFabric
Memory
I/O
User Interface
ExternalNetwork
Memory
0 N 1−
0,...,A 1 0,...,A 1− −
24Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Flynn‐Johnson Taxonomy
Ref: E. E. Johnson, "Completing an MIMD Multiprocessor Taxonomy", Computer Architecture News, June 1988.
Message Passing
Shared Memory
Distributed Memory
Global Memory
Multiple Data
Multiple InstructionSingle Instruction
DMMP
GMMP
DMSM
GMSM
SIMD
MISDSISDSingle Data
MIMD
25Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Shared Memory versus Message Passing
Multiple CPUs exchangemessages
Multiple CPUs accessshared addresses in common address space
Interprocesscommunication
Message Passing Interface (MPI)OpenMPAPI
Course grain parallelism
Heavy parallel threads
Long code length
Large data volume
Fine grain parallelism
Light parallel threads
Short code length
Small data volume
Applicability
Independent of number of CPUs
Limited by network capacity
Limited by complexity ofCPU access to sharedmemory
Scalability
Message formulation
Message distribution
Network overhead
Cache / RAM updates
Cache coherencyCommunicationoverhead
Message PassingShared Memory
26Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Amdahl's Law for Multiprocessors
( )( )1 1
5 1 0.8 1 0.8 0.20.8
1 0.8
With contemporary technology, for most applications, 80%
idealN N
P
S CPIN
N
F
→∞ →∞= = − × + × =
− +
≈
⎯⎯⎯→ ⎯⎯⎯→
( )
/
' ' '
PP P P
P P
NICF IC F ICIC
CPI CPI CPI N
CPI IC CPI ICSCPICPI IC CPI IC IC ICN
ττ
== ⇒ = ×
→ =
× × ×= =
× × ⎡ ⎤× − + ×⎢ ⎥⎣ ⎦
parallel
fraction of program that can be parallelized
For parallel work
Divide work among processorsParallelization
( ) ( )1
1 1 PP P P
CPICPI FF CPI F FN N
= =− × + × − +
27Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
MP and HT Performance EnhancementsMP Without Hyper Threading
0.65
0.85
S/CPU
2.64
1.72
SCPUs
Hyper Threading Without MP
0.60
S/CPU
1.21
SCPUs Speed‐up for On Line Transaction Processing (OLTP)
( )
( )
PP
P
PP
1.7
2
2.
1F
1 FF 0.8
1F
1 F6
4
=− +
≈=
− +
28Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
On Line Transaction Processing (OLTP)Model
TransactionsClient requests to server + database
Banking, order processing, inventory management, student info system
Independent work — inherently multithreaded1 thread per requestServer sees large batch of small parallel threads
Short sequential code SQL transactions — short accesses to multiple tables
Complex (DB) access ⇒ memory latency ⇒ CPU stalls per threadCPIOLTP = 1.27 on 8-pipeline dynamic scheduling superscalarCPISPEC = 0.31 on same hardware
←→ Server Database←→Request Buffer
←→←→←→←→
Network
←→←→←→←→
Client
Client
...
Client
29Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Memory Access Complexities in OLTPSQL thread
Access multiple tablesExample
Order processing ⇒ customer account, inventory, shipping, ...
Tables in separate areas of memory Cache conflicts
Generates multiple memory latencies per thread
Multiple threadsThreads access same tablesRequires atomic SQL transactionRequires thread synchronizationSynchronization ⇒ locks on parallel threads ⇒ memory latencies
SMT advantageProcess many threads to hide memory latency
30Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Multiprocessor Efficiency
( )
( ) ( )
1
1
1
1
1
1 1 111
0
P
P
P
FP
PF
PF P PP
N
S NFFN
S SE FS N N F N FFN
E
=
=
=
→∞
= =− +
= = = × =− +− +
⎯⎯⎯→
Actual speedup relative to ideal (linear) speedupSpeedup per processor
Ideal speedup
Efficiency
Efficiency of large system
31Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Grosch's Law versus Amdahl's LawComputers enjoy economies of scale
Claim formulated by Herbert R. J. Grosch at IBM in 1953Performance-to-price ratio rises as price rises
If cost of multiprocessor system is linear in unit price of CPU
Amdahl's law implies
~ 2
/ ~
sG G
G
k C k C sk C
= × = = =
×
performance constant cost constant
performance cost
, ,
( )
( ) ( ) ( )( )
2 2 2 2
1performance
performance
ost
performance
= ×
= × × = ⇒ = =G G
C N N
NN k N k N S N
α
α α
( )( ) ( ) ( )
( )( ) ( )
( )( ) ( ) ( )
11 1/
1 ,1
Amdahl
performance
performancefor some constant k
= = =− +− + − +
=− +
Amdahl Amdahl
P PP PP P
Amdahl
P PAmdahl
Cost Nk kN F F F Cost N FF FkN Cost N
NCost N F Cost N F
k
αα
α
32Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Claims Against Amdahl's LawAssumption in Amdahl's law
FP = constantSuppose instead
Gustafson-Barsis LawParallel part of large problem can scale with problem size
( ) ( )
( ) ( )
1
1 111 11
1
with P P P N
NP
P
N
F F N F N
S NF N
F N NNSEN
→∞
→∞
→∞
= ⎯⎯⎯→
= ⎯⎯⎯→ =− +− +
= ⎯⎯⎯→
n
s p n ns p n n
s p=
= + × =+ ×
⎯⎯⎯→+ large
run time in serial execution size of problem
speedup compared to serial execution
,
33Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Interconnection Network Types
Permanent point‐to‐point connections between end nodes
Requires N × (N‐1) point‐to‐point connectionsFull connectivity
Requires multiple hops between end nodes
Nodes perform arbitration for bus access
Switch elements configured specifically for each transfer
N × N simultaneous non‐blocking connectionsCrossbar
Full connectivity switch assembled from multiplesingle stage switches
Not simultaneously non‐blockingMultistage
N × N switch with limited connectivityData makes multiple node‐to‐node hopsbetween end nodes (source to destination)
Single Stage
Switch
End nodes connect to N identical buses in parallelMultiple
Simplest implementation with standard I/O bustypes VME, SCSI, PCI, datakit, etcSingleBus
Dynamic
Limited connectivity
Static
34Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Communication Overhead and Amdahl’s Law
/P P P
comm
comm comm commP P
F IC F ICCPI CPI CPI N
TT CPI IC CPI F ICτ τ
= ⇒ = ×→ =
= × × = × × ×
parallel
fraction of program that can be parallelized
Ideally
Including communication overhead in speedup
Parallelization with overhead
( )
( )
/
1
1
comm
commoverhead
commP P P
comP P
CPI
F CPI CPICPI ICS CPICPI IC F F IC CPI F ICN
CPICPICPI F F CPIN
=
=
=
×=
× × − + × × + × ×
=× − + × +
overhead factor
processor clock cycles devoted to communication
per instruction executed in parallel
( )
111m
P P P overheadF F F FN
=⎛ ⎞× − + +⎜ ⎟⎝ ⎠
35Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016
Large Communication Overhead
Communication overhead can eliminate benefits of parallelization
( )
( ) ( )
( )
max
max 1
111
1 1lim 1 11
11 1
1overhead
P P overhead
N P P overheadP P overhead
P overhead
F
SF F F
N
SF F FF F F
N
F FS
→∞
→
=⎛ ⎞− + +⎜ ⎟⎝ ⎠
= =− +⎛ ⎞− + +⎜ ⎟
⎝ ⎠
=− −
⎯⎯⎯⎯→
Parallelization with large overhead
overhead factor
communication activity
processing activity
comm
overheadCPIF
CPI==
=