Beyond Instruction Level Parallelism - School of Computer ...cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide07-1.pdf · Advanced Computer Architecture — Hadassah College —

1Dr. Martin LandThread Level ParallelismAdvanced Computer Architecture — Hadassah College — Fall 2016

Beyond Instruction Level

Parallelism


Summary of Superscalar Processing

IF ID

DataMemory

InstructionMemory

InstructionPool

ReorderBuffer

EX

Load

EX

Store

EX

Single CPU

Virtual registers andarchitectural registers

prevent false dependencies

Multiple execution units

Multiple instructionsissued per CC

from instruction pool

Branch predictionand trace cacheminimize branch

penalties

Out-of-Order ExecutionIn-Order Retirement

Registers

Predicationfor conditionalcancellation

of instructions

Prefetchminimizes

cache misses

Stream bufferminimizes

cache misses


ILP Scalability Limit

Require larger source of independent instructionsExploit inherent parallelism in software operations

( )( )

2 2'' '

1'

execution units

pipeline stages

instruction window

i i u is uideal ideal

i i s is u

EU EU u s EU

u u us u

s s ss u

IC IC IC

αβ α

β λ λβ α

α β

→ =×

→ = → =+ ×

→ =

Scaling instruction window and decoder rate

Difficulties Decode 15 instructions per CC

Despite cache misses, mispredictions, …Maintain window of 120 independent instructions

Branches ≈ 20% of instructions 25 – 30 branches in window ⇒ large misprediction probability

15 8 106 2

120

15 14.9

instructions executing in parallel

instructions decoded per CC

u s u s

EU

ideal

IC

α β α β

λ

→ →

= = ⇒ × =

=

> ≥

Scaling 6 15 EUs with 2 8 superpipelined stages


Sequential and Parallel Operations Programs combine parallel + sequential constructs

High-level job → model-dependent sectionsProcessesThreadsClassesProceduresControl blocks

Sections compiled → ISA = low level CPU operationsData transfersArithmetic/logic operationsControl operations

High-level job → executionMachine instructions — small sequential operations

Local information on 2 or 3 operands

CPU cannot recognize abstract model-dependent structuresInformation about inherent parallelism lost in translation to CPU


Parallelism in Sequential Jobs Concurrency in high-level job

Two or more independent activities in defined to execute at same timeParallel — execute simultaneously on multiple copies of hardware Interleave — single hardware unit alternates between activities

Example Respond to mouse eventsRespond to keyboard inputAccept network message

Functional concurrencyProcedure maps A' = R(θ) × ACode performs sequential operations

Ax' = Ax cos θ + Ay sin θAy' = -Ax sin θ + Ay cos θ

Data concurrencyProcedure maps C = A + BCode performs sequential operations

for (i = 0, i < n, i++) C[i] = A[i] + B[i]

A

BC

θA

A'


Extracting Concurrency in Sequential ProgrammingProgrammer

Codes in high level languageCode reflects abstract programming models

Procedural, object oriented, frameworks, structures, system calls, ...

Compiler Converts high level code to sequential list

Localized CPU instructions and operands

Information about inherent parallelism lost in translationHardware applies heuristics

Partially recover concurrency as ILP

Concurrency Identified / Reconstructed Technique

Parallelism in single instruction executionPipelining

Operation independenceDynamic scheduling superscalar

Control blocksBranch and trace prediction

Decision treesPredication


Extracting Parallelism in Parallel ProgrammingProgrammer

Identifies inherently parallel operations in high level jobFunctional concurrencyData concurrency

Translates parallel algorithm into source codeSpecifies parallel operations to compiler

Parallel threads for functional decompositionParallel threads for data decomposition

Hardware Receives deterministic instructions reflecting inherent parallelism

Code + threading instructions

Disperses instructions to multiple processors or execution unitsVectorized operationsPre-grouped independent operationsThread Level Parallelism


The "Old" Parallel Processing1958 — research at IBM on parallelism in arithmetic operations1960 – 1980

Mainframe SMP machines with N = 4 to 24 CPUsOS dispatches process from shared ready queue to idle processor

1980 – 1995Research boomAutomated parallelization by compiler

Limited success — compilers cannot identify inherent parallelism

Parallel constructs in high level languagesLong learning curve — parallel programmers are typically specialists

Inherent complexitiesProcessing and communication overhead

Inter-process message passing — spawning/assembling with many CPUsSynchronization to prevent race conditions (data hazards)

Data structuresShared memory modelGood blocking to cache organization

1999 — fashionable to consider parallel processing a dead end


Rise and Fall of Multiprocessor R&D

Ref: Mark D. Hill and Ravi Rajwar, "The Rise and Fall of Multiprocessor Papers in the International Symposium on Computer Architecture (ISCA)",http://pages.cs.wisc.edu/~markhill/mp2001.html

Topics of papers submitted to ISCA1973 to 2001

Sorted as percent of total

ISCA — International Symposium on Computer Architecture

Hennessey and Patterson joke that proper place formultiprocessing in their book is Chapter 11 (a section of USbusiness law on bankruptcy)


It's Back — the "New" Parallel ProcessingCrisis rebranded as opportunity

Processor clock speed near physical limit (speed of light = 3 × 1010 cm/s)

HeatingClock rate ↑ ⇒ heat output ↑CPU power ↑ ⇒ chip size ↑ ⇒ heat transfer rate ↓ ⇒ CPU overheats

Superscalar ILP cannot rise significantlyInstruction window ~ 100 independent instructions

"Old" parallel processing is not sufficientSome interesting possibilities

Multicore processors cheaper and easier to manufactureUser level thread managementMultithreaded OS kernels and OS level thread schedulingCompiler support for thread management APIsNew debugging tools

CPU

3 cm

in out

in

out

10delay 10

10 10clock max

3 cm10 sec

3 10 cm/sec

10 sec R 10 Hz 10 GHz

−

−

τ > =×

τ ≈ ⇒ < ≈

delayτ


Processes and ThreadsProcess

One instance of an independently executable programBasic unit of OS kernel scheduling (on traditional kernel)Entry in process control block (PCB) defines resources

ID, state, PC, register values, stack+memory space, I/O descriptors, …

Process context switch → high volume transfer operationOrganized into one or more owned threads

Thread One instance of independently executable instruction sequenceNot organized into smaller multitasked unitsLimited private resources — PC, stack, and register values

Other resources shared with other threads owned by process

Scheduled by kernel or threaded user codeThread switch → low volume transfer operation


Multithreaded SoftwareThreaded OS kernel

Process = one or more threadsMultithreaded application

Organized as more than one threadThreads scheduled by OS or application codeNot specific to parallel algorithms

Classic multithreading exampleMultithreaded web server

Serves multiple clientsCreates thread per client

Server process creates listen threadListen thread blocks —waits for service requestService request → listen thread creates new serve thread

Serve thread handles web service request

Listen thread returns to blocking

listen

request

serveclient

server

response

newthread


Decomposing WorkDecomposition

Break down program into basic activities Identify dependencies between activities"Chunking" — choose size parameters for coded activities

Functional DecompositionEach thread assigned different activityExample — 3D game

Thread 1 updates ground Thread 2 updates skyThread 3 updates character

Data DecompositionEach thread runs same code on separate block of dataExample — 3D game

Divide sky into n sectionsThreads 1 — n update section of sky


Hardware Implementation of MultithreadingNo special hardware requirements

Multithreaded code runs on single / multiple CPU systemRun-time efficiency depends on hardware/software interaction

Coarse-grained multithreadingSingle CPU swaps among threads on long stall

Fine-grained multithreadingSingle CPU swaps among threads on each clock cycle

Simultaneous multithreading (SMT)Superscalar CPU pools instructions from multiple threadsEnlarges instruction window

Hyper-ThreadingIntel technology combining fine-grained multithreading and SMT

MultiprocessingDispatches threads to CPUs


Superscalar CPU MultithreadingSingle thread on superscalar

FetchDecode

clock cycles

executionunits

Issued instruction

Empty EUROB

Course grained multithreading on superscalar

FetchDecode

clock cycles

executionunits

Thread 1

Empty EU

ROB

Thread 2Thread 3

Thread 4

Fine grained multithreading on superscalar

FetchDecode

clock cycles

executionunits

Thread 1

Empty EU

ROB

Thread 2Thread 3

Thread 4


Simultaneous Multithreading

FetchDecode

clock cycles

executionunits

Thread 1

Empty EU

ROB

Thread 2Thread 3

Thread 4

Simultaneous multithreading on superscalarPool instructions from multiple threadsInstructions labeled in reorder buffer (ROB)

PC Thread numberOperands Status

Large instruction window

Advantage on mispredictionsOnly thread with misprediction is cancelled

Other threads continue to execute

Cancellation rate from mispredictions → ¼ single-thread cancellation rate


Hyper‐Threading

Two copies of architectural state + one execution coreFine grained N = 2 multithreadingInterleaves threads on In-Order fetch/decode/retire units

Issue instructions to shared Out-of-Order execution coreSimultaneous N = 2 multithreading (SMT)Executes instructions from shared instruction pool (ROB)

Stall in one thread ⇒ other thread continuesBoth CPUs keep working on most clock cyclesAdvantage of course-grained N = 2 multithreading

ArchitecturalState

ExecutionCore

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

Architectural StateRegisters, stack pointersand program counter

Execution CoreALU, FPU, vectorprocessors, memory unit


Thread CoexistenceMultiprocessor code

Provides source of independent instructionsPermits high processor utilization

Independent applications running in parallel Unrelated instructions with no data dependenciesIndependence can create resource conflicts

Require different data blocks in cacheUse different branch prediction cache and trace cache

Parallel threads of single applicationDifferent pieces of same programRun in coordinated fashion

Communicate, synchronize, exchange data

Stall in thread can stall related threadsCache miss, page fault, branch misprediction, ...


Helper Thread ModelPerforms no committed work

Do not change any program resultResults not committed to memoryRequire no additional hardware support

Performs loads and branches that appear in work threadEncounter cache misses before work threadPrepares cachesPrevents costly misses


Helper Thread ExampleExample

L: MUL R4, R6, R8ADD R4, R6, R9ADD R1, R2, R3SUB R3, R4, R5

LW R6, 0(R1) ; cache miss ADD R6, R3, R2

BEZQ R6, L ; misprediction

L: ADD R1, R2, R3

LW R6, 0(R1) ; cache miss; cache update

BEQZ R6, L ; misprediction; update predictor

L: MUL R4, R6, R8ADD R4, R6, R9ADD R1, R2, R3SUB R3, R4, R5

LW R6, 0(R1) ; no cache missADD R6, R3, R2

BEZQ R6, L ; no misprediction

Helper ThreadWork Thread


Flynn Taxonomy for CPU Architectures

SISD Standard single CPU machine with single or multiple pipelines

SIMDVector processor or processor arrayPerforms one operation on data set on each CC

MISDPerform multiple operations on one data set each CCFew products — IBM Watson IA applies multiple algorithms to same data

MIMDMultiprocessor or cluster computerPerform multiple operations on multiple data sets on each CC

Ref: M.J. Flynn, "Very High-Speed Computers", Proceedings of the IEEE, Dec. 1966.

Multiple Instruction Multiple Data

MIMD

Single Instruction Multiple Data

SIMD

Multiple Instruction Single Data

MISD

Single Instruction Single Data

SISDData

Instruction


Multiprocessor ArchitectureSISD/SIMD workstation

Dual core CPUArchitectural registersCacheExecution units

I/O systemLong-term storagePeripheral devicesSystem support functions

Main memory Internal network system

MIMD multiprocessor Multiple CPUsI/O systemMain memory

Unified or partitioned

Internal network From simple bus to complex mesh

CPU CPU...

Memory Memory...

I/O User Interface

ExternalNetwork

InternalNetwork

זיכרון ראשי

Main Memory(RAM)

מתאם אפיק Bus Adapter

אפיק זיכרוןMemory Bus

פלט/אפיק ק לטI/O Bus

פלט /בקר קלטI/O Controller



Disk ממשקמשתמש

רש ת תק שורת communications

network

זיכרון מטמוןcache memory

ליבת עיבוד ואוגרים

Processor Coreand

Registers

)ליבות-דו(יחידת החישוב המרכזי Dual Core Central Processing Unit (CPU)

ליבת עיבוד ואוגרים

Processor Coreand

Registers

Front Side Bus


Message Passing System

N nodes — processors with private address space A

Processors communicate by passing messages over internal network

Messages combine data and memory synchronization

Shared Memory System

Global memory space A physically partitioned into M blocks

N processors access full memory space via internal network

Processors communicate by write/read to shared addresses

Synchronize memory accesses to prevent data hazards

Network Topology → Parallelization Model

CPU CPU...

Memory Memory...

I/OUser Interface

ExternalNetwork

SwitchingFabric

0 N 1−

( ) ( )( )0,..., A/M 1 M 1 A/M ,...,A 1− − −

0 M 1−

CPU CPU...

SwitchingFabric

Memory

I/O

User Interface

ExternalNetwork

Memory

0 N 1−

0,...,A 1 0,...,A 1− −


Flynn‐Johnson Taxonomy

Ref: E. E. Johnson, "Completing an MIMD Multiprocessor Taxonomy", Computer Architecture News, June 1988.

Message Passing

Shared Memory

Distributed Memory

Global Memory

Multiple Data

Multiple InstructionSingle Instruction

DMMP

GMMP

DMSM

GMSM

SIMD

MISDSISDSingle Data

MIMD


Shared Memory versus Message Passing

Multiple CPUs exchangemessages

Multiple CPUs accessshared addresses in common address space

Interprocesscommunication

Message Passing Interface (MPI)OpenMPAPI

Course grain parallelism

Heavy parallel threads

Long code length

Large data volume

Fine grain parallelism

Light parallel threads

Short code length

Small data volume

Applicability

Independent of number of CPUs

Limited by network capacity

Limited by complexity ofCPU access to sharedmemory

Scalability

Message formulation

Message distribution

Network overhead

Cache / RAM updates

Cache coherencyCommunicationoverhead

Message PassingShared Memory


Amdahl's Law for Multiprocessors

( )( )1 1

5 1 0.8 1 0.8 0.20.8

1 0.8

With contemporary technology, for most applications, 80%

idealN N

P

S CPIN

N

F

→∞ →∞= = − × + × =

− +

≈

⎯⎯⎯→ ⎯⎯⎯→

( )

/

' ' '

PP P P

P P

NICF IC F ICIC

CPI CPI CPI N

CPI IC CPI ICSCPICPI IC CPI IC IC ICN

ττ

== ⇒ = ×

→ =

× × ×= =

× × ⎡ ⎤× − + ×⎢ ⎥⎣ ⎦

parallel

fraction of program that can be parallelized

For parallel work

Divide work among processorsParallelization

( ) ( )1

1 1 PP P P

CPICPI FF CPI F FN N

= =− × + × − +


MP and HT Performance EnhancementsMP Without Hyper Threading

0.65

0.85

S/CPU

2.64

1.72

SCPUs

Hyper Threading Without MP

0.60

S/CPU

1.21

SCPUs Speed‐up for On Line Transaction Processing (OLTP)

( )

( )

PP

P

PP

1.7

2

2.

1F

1 FF 0.8

1F

1 F6

4

=− +

≈=

− +


On Line Transaction Processing (OLTP)Model

TransactionsClient requests to server + database

Banking, order processing, inventory management, student info system

Independent work — inherently multithreaded1 thread per requestServer sees large batch of small parallel threads

Short sequential code SQL transactions — short accesses to multiple tables

Complex (DB) access ⇒ memory latency ⇒ CPU stalls per threadCPIOLTP = 1.27 on 8-pipeline dynamic scheduling superscalarCPISPEC = 0.31 on same hardware

←→ Server Database←→Request Buffer

←→←→←→←→

Network

←→←→←→←→

Client

Client

...

Client


Memory Access Complexities in OLTPSQL thread

Access multiple tablesExample

Order processing ⇒ customer account, inventory, shipping, ...

Tables in separate areas of memory Cache conflicts

Generates multiple memory latencies per thread

Multiple threadsThreads access same tablesRequires atomic SQL transactionRequires thread synchronizationSynchronization ⇒ locks on parallel threads ⇒ memory latencies

SMT advantageProcess many threads to hide memory latency


Multiprocessor Efficiency

( )

( ) ( )

1

1

1

1

1

1 1 111

0

P

P

P

FP

PF

PF P PP

N

S NFFN

S SE FS N N F N FFN

E

=

=

=

→∞

= =− +

= = = × =− +− +

⎯⎯⎯→

Actual speedup relative to ideal (linear) speedupSpeedup per processor

Ideal speedup

Efficiency

Efficiency of large system


Grosch's Law versus Amdahl's LawComputers enjoy economies of scale

Claim formulated by Herbert R. J. Grosch at IBM in 1953Performance-to-price ratio rises as price rises

If cost of multiprocessor system is linear in unit price of CPU

Amdahl's law implies

~ 2

/ ~

sG G

G

k C k C sk C

= × = = =

×

performance constant cost constant

performance cost

, ,

( )

( ) ( ) ( )( )

2 2 2 2

1performance

performance

ost

performance

= ×

= × × = ⇒ = =G G

C N N

NN k N k N S N

α

α α

( )( ) ( ) ( )

( )( ) ( )

( )( ) ( ) ( )

11 1/

1 ,1

Amdahl

performance

performancefor some constant k

= = =− +− + − +

=− +

Amdahl Amdahl

P PP PP P

Amdahl

P PAmdahl

Cost Nk kN F F F Cost N FF FkN Cost N

NCost N F Cost N F

k

αα

α


Claims Against Amdahl's LawAssumption in Amdahl's law

FP = constantSuppose instead

Gustafson-Barsis LawParallel part of large problem can scale with problem size

( ) ( )

( ) ( )

1

1 111 11

1

with P P P N

NP

P

N

F F N F N

S NF N

F N NNSEN

→∞

→∞

→∞

= ⎯⎯⎯→

= ⎯⎯⎯→ =− +− +

= ⎯⎯⎯→

n

s p n ns p n n

s p=

= + × =+ ×

⎯⎯⎯→+ large

run time in serial execution size of problem

speedup compared to serial execution

,


Interconnection Network Types

Permanent point‐to‐point connections between end nodes

Requires N × (N‐1) point‐to‐point connectionsFull connectivity

Requires multiple hops between end nodes

Nodes perform arbitration for bus access

Switch elements configured specifically for each transfer

N × N simultaneous non‐blocking connectionsCrossbar

Full connectivity switch assembled from multiplesingle stage switches

Not simultaneously non‐blockingMultistage

N × N switch with limited connectivityData makes multiple node‐to‐node hopsbetween end nodes (source to destination)

Single Stage

Switch

End nodes connect to N identical buses in parallelMultiple

Simplest implementation with standard I/O bustypes VME, SCSI, PCI, datakit, etcSingleBus

Dynamic

Limited connectivity

Static


Communication Overhead and Amdahl’s Law

/P P P

comm

comm comm commP P

F IC F ICCPI CPI CPI N

TT CPI IC CPI F ICτ τ

= ⇒ = ×→ =

= × × = × × ×

parallel

fraction of program that can be parallelized

Ideally

Including communication overhead in speedup

Parallelization with overhead

( )

( )

/

1

1

comm

commoverhead

commP P P

comP P

CPI

F CPI CPICPI ICS CPICPI IC F F IC CPI F ICN

CPICPICPI F F CPIN

=

=

=

×=

× × − + × × + × ×

=× − + × +

overhead factor

processor clock cycles devoted to communication

per instruction executed in parallel

( )

111m

P P P overheadF F F FN

=⎛ ⎞× − + +⎜ ⎟⎝ ⎠


Large Communication Overhead

Communication overhead can eliminate benefits of parallelization

( )

( ) ( )

( )

max

max 1

111

1 1lim 1 11

11 1

1overhead

P P overhead

N P P overheadP P overhead

P overhead

F

SF F F

N

SF F FF F F

N

F FS

→∞

→

=⎛ ⎞− + +⎜ ⎟⎝ ⎠

= =− +⎛ ⎞− + +⎜ ⎟

⎝ ⎠

=− −

⎯⎯⎯⎯→

Parallelization with large overhead

overhead factor

communication activity

processing activity

comm

overheadCPIF

CPI==

=

Documents

Beyond Instruction Level Parallelism - School of Computer ...cs.hadassah.ac.il/staff/martin/Adv_Architecture/slide07-1.pdf · Advanced Computer Architecture — Hadassah College —