Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CS152: Computer Systems ArchitectureMultiprocessing and Parallelism
Sang-Woo Jun
Winter 2019
Large amount of material adapted from MIT 6.004, “Computation Structures”,Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”,
and CS 152 Slides by Isaac Scherson
High performance via parallelism
Given a fixed amount of work, we want to do as much work at once as possible, to reduce the amount of time it takeso Subject to computing power restrictions
o Subject to inherently exploitable parallelism in the work
What we looked at: Instruction-Level Parallelismo Pipelining works on multiple instructions, each at different stages
• Subject to pipeline depth, subject to dependencies between instructions
o Superscalar works on multiple instructions, even some on same stages• Subject to replicated pipeline count (ways), subject to dependencies between instructions
Some more exploitable parallelism
Data parallelismo When processing on multiple data elements do not have much dependency
between each other, they can be done in parallel
o (Typically same work on multiple data)
Task parallelismo Work can be divided into multiple independent processes, which do not have
much dependency between each other
o (Typically different work)
Savni, “Data parallelism in matrix multiplication”
Of course distinction is often muddled in the real world
How much performance improvement?
Amdahl’s law (Gene Amdahl, 1967)o Theoretical speedup for a given job as system resources are improved
o Speedup = 1
1−𝑝 +(𝑝
𝑠)
• p is the fraction of the parallelizable portion of the job
• s is the improvement of system resources (e.g., processor count)
o If p is zero, there is no speedup no matter how much processors we have!
o If p is 1, we will have perfect scaling of performance with more processors
Assume no synchronization overhead for upper bound estimation
How much performance improvement?
Example: Can we get 90× speedup using 100 processors?
o Remember: Speedup = 1
1−𝑝 +(𝑝
𝑠)
o s = 100 with 100 processors, and we want speedup to be 90
o Solving: Fparallelizable = 0.999
Need non-parallelizable part to be 0.1% of original time
90/100F)F(1
1Speedup
ableparallelizableparalleliz
Scaling Example
Workload: sum of 10 scalars, and 10 × 10 matrix sum
Single processor: Time = (10 + 100) × tadd
10 processorso Assuming adding 10 scalars can’t be parallelizedo Time = 10 × tadd + 100/10 × tadd = 20 × tadd
o Speedup = 110/20 = 5.5 (55% of potential peak)
100 processorso Time = 10 × tadd + 100/100 × tadd = 11 × tadd
o Speedup = 110/11 = 10 (10% of potential peak)
Scaling Example
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
10 processorso Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
o Speedup = 10010/1010 = 9.9 (99% of potential peak)
100 processorso Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
o Speedup = 10010/110 = 91 (91% of potential peak)
Why focus on parallelism?
Of course, large jobs require large machines with many processorso Exploiting parallelism to make the best use of supercomputers have always been
an extremely important topic
But now even desktops and phones are multicore!o Why? The end of “Dennard Scaling”
What happened?
Dennard Scaling
Robert Dennard, 1974
“Power density stays constant as transistors get smaller”
Intuitively:o Smaller transistors → shorter propagation delay → faster frequency
o Smaller transistors → smaller capacitance → lower voltage
o 𝑃𝑜𝑤𝑒𝑟 ∝ 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒2 × 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
o Moore’s law → Faster performance @ Constant power!
Unfortunately, as transistors became too small, there was a limit to how low voltage can go
Option 1: Continue Scaling Frequency at Increased Power Budget
0.014 μ
Option 2: Stop Frequency Scaling
Dennard Scaling Ended(~2006)
Danowitz et.al., “CPU DB: Recording Microprocessor History,” Communications of the ACM, 2012
Looking Back: Change of Predictions
But Moore’s Law Continues Beyond 2006
State of Things at This Point (2006)
Single-thread performance scaling endedo Frequency scaling ended (Dennard Scaling)
o Instruction-level parallelism scaling stalled … also around 2005
Moore’s law continueso Double transistors every two years
o What do we do with them?
K. Olukotun, “Intel CPU Trends”
Instruction Level Parallelism
Crisis Averted With Manycores?
Crisis Averted With Manycores?
We’ll get back to this point later. For now, multiprocessing!
The hardware for parallelism:Flynn taxonomy (1966) recap
Data Stream
Single Multi
InstructionStream
Single SISD(Single-Core Processors)
SIMD (GPUs, Intel SSE/AVX extensions, …)
Multi MISD(Systolic Arrays, …)
MIMD(VLIW, Parallel Computers)
Flynn taxonomy
Single-InstructionSingle-Data(Single-Core Processors)
Multi-InstructionSingle-Data(Systolic Arrays,…)
Single-InstructionMulti-Data(GPUs, Intel SIMD Extensions)
Multi-InstructionMulti-Data(Parallel Processors)
Today
Shared memory multiprocessor
SMP: shared memory multiprocessoro Hardware provides single physical
address space for all processors
o Synchronize shared variables using locks
o Memory access time• UMA (uniform) vs. NUMA (nonuniform)
Also often SMP: Symmetric multiprocessoro The processors in the system are identical, and are treated equally
Typical chip-multiprocessor (“multicore”) consumer computers
Memory System Architecture
L3 $
QPI / UPI
DRAM DRAM
Core
L1 I$ L1 D$
L2 $
Package
Core
L1 I$ L1 D$
L2 $
L3 $
Core
L1 I$ L1 D$
L2 $
Package
Core
L1 I$ L1 D$
L2 $
UMA between cores sharing a package,But NUMA across cores in different packages.Overall, this is a NUMA system
Memory System Bandwidth Snapshot (2020)
QPI / UPI
DRAM DRAM
Processor Processor
DDR4 2666 MHz128 GB/s
Ultra Path InterconnectUnidirectional
20.8 GB/s
Cache Bandwidth Estimate64 Bytes/Cycle ~= 200 GB/s/Core
Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-dieAll sorts of things are now on-die! Even network controllers!
Memory system issues with multiprocessing
Suppose two CPU cores share a physical address spaceo Distributed caches (typically L1)
o Write-through caches, but same problem for write-back as well
Time
step
Event CPU A’s
cache
CPU B’s
cache
Memory
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
Wrong data!
Memory system issues with multiprocessing
What are the possible outcomes from the two following codes?o A and B are initially zero
o 1,2,3,4 or 3,4,1,2 etc : “01”
o 1,3,2,4 or 1,3,4,2 etc : “11”
o Can it print “10”, or “00”? Should it be able to?
Processor 1:
1: A = 1;2: print B
Processor 2:
3: B = 1;4: print A
Memory problems with multiprocessing
Cache coherency (The two CPU example)o Informally: Read to each address must return the most recent value
o Complex and difficult with many processors
o Typically: All writes must be visible at some point, and in proper order
Memory consistency (The two processes example)o How updates to different addresses become visible (to other processors)
o Many models define various types of consistency• Sequential consistency, causal consistency, relaxed consistency, …
o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly
Keeping multiple caches coherent
A bad idea: Broadcast all writes to all other coreso Lots of unnecessary on-chip network traffic
o Each cache needs to do many lookups to see if incoming write notifications need to be applied• Further bad idea: applying all write requests result in all caches have copies of all data
No point of having distributed cache!
Keeping multiple caches coherent
Basic ideao If a cache line is only read, many caches can have a copy
o If a cache line is written to, only one cache at a time may have a copy
Typically two ways of implementing thiso “Snooping-based”: All cores listen to requests made by others on the memory bus
o “Directory-based”: All cores consult a separate entity called “directory” for each cache access
All caches listen (“snoop”) to the traffic on the memory buso There is enough information just from that to ensure coherence
Snoopy cache coherence
Core Core Core
Cache Cache Cache
Shared Cache
DRAM
Memory bus
Detour: A bus interconnect
A bus is an interconnect that connects multiple entitieso Cores, memory, (peripheral devices…)
Clock-synchronous, shared resourceo Only one entity may be transmitting at any given clock cycle
o All data transfers are broadcast, and all entities on the bus can listen to all communication
o If multiple entities want to send data (a “multi-master” configuration) a separate entity called the “bus arbiter” must assign which master can write at a given cycle
Contrary to switched interconnects like packet-switched, or crossbaro We’ll talk about these later!
MSI: A simple cache coherency protocol
Each cache line can be in one of three stateso M (“Modified”): Dirty data, no other cache has copy
• Safe to write to cache. DRAM is out-of-date
o S (“Shared”): Not dirty data, other caches may have copy
o I (“Invalid”): Data not in cache
Cache tag augmented with state information
Core Core Core
Cache Cache Cache
Snoopy implementation of MSI
Originally, each cache could send two requests on the memory buso Read: For both read and write requests from core (write-back)
o Flush: For cache evictions
Cache is modified with two more requestso Read: For read requests from core
o ReadExclusive: For write requests from core• Data is read from DRAM to cache, with the intention to update
o Upgrade: When a previously read cache line will be written to
o Flush: Writeback to DRAM, in the case of cache evictions
Snoopy implementation of MSI
Core wants to read: Read from DRAM (via “Read” command), set to “S”
Core wants to write: o If cache has data: Send “Upgrade” command, set to “M”
o If not: Read from DRAM (via “Read exclusive” command), set to “M”
Snoops “Read” command from another core:o If I have data as “S”, supply data
o If I have data as “M”, flush and downgrade myself to “S”
Snoops “ReadExclusive” or “Upgrade”:o If I have data as “S”: invalidate
o If I have data as “M”: flush, invalidateImpossible to snoop “Upgrade” while I am “M”.I should be the only one with this data while I am “M”
Snoopy implementation of MSI
Jos van Eijndhoven, et. al., “Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices”
Back to coherence example
Writes invalidate other caches, ensuring correct coherenceo This time assuming write-back
Time
step
Event CPU A’s
cache
CPU B’s
cache
Memory
0 0
1 CPU A reads X 0 (S) 0
2 CPU B reads X 0 (S) 0 (S) 0
3 CPU A writes 1 to X 1 (M) (I) 0
4 CPU B reads X 1 (S) 1 (S) 1
Beyond MSI
Many improvements on the MSI protocol proposed, implementedo MESI (adds “exclusive”), MOESI (adds “owner”), MESIF (adds “forward”)
Example: MESIo Addresses situation where data is read, and then updated in memory by one core
• Very common situation! (LW, process, SW)
• MSI would send Read, Upgrade commands, but if no other processes access this cache line, perhaps upgrade is unnecessary
o When MESI snoops a Read command, each core checks its cache to see if it has that line• If cache exists in another core, data is copied between caches (new command)
• If not, cache is in “E” state, which can upgrade to “M” without consulting others
Cache-to-cache data copy reduces main memory accesses in practice, performance improvement!
Limitations of snoopy caches
Inter-core cache coherence communications are all broadcasto Memory bus becomes a single point of serialization
o Not very scalable beyond a handful of cores
Perhaps not that big of an issue with desktop-class CPUso With ~8 cores and typical ratio of memory instructions, bus not yet the primary
bottleneck
o At least Intel Haswell seems to be still using snoopy• Rumored to use adaptive MESIF protocol
Directory-based cache coherence
Uses a separate entity called a “directory” which keeps track of the cache line state (MSI?)o Cache accesses first consult the directory, which either interacts with DRAM or
remote caches
o Directory can be implemented in a distributed way, to disperse the bottleneck
Pros of directory-based coherenceo Given a scalable interconnect, much more scalable than a bus
Cons of directory-based coherenceo Much more complex than snoopy, involving circuit implementations of distributed
algorithms
We won’t cover this in detail right now…
Performance issue with cache coherence:False sharing
Different memory locations, written to by different cores, mapped to same cache lineo Core 1 performing “results[0]++;”
o Core 2 performing “results[1]++;”
Remember cache coherenceo Every time a cache is written to, all other instances need to be invalidated!
o “results” variable is ping-ponged across cache coherence every time
o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)
Solution: Store often-written data in local variables
Some performance numbers with false sharing
Memory problems with multiprocessing
Cache coherency (The two CPU example)o Informally: Read to each address must return the most recent value
o Complex and difficult with many processors
o Typically: All writes must be visible at some point, and in proper order
Memory consistency (The two processes example)o How updates to different addresses become visible (to other processors)
o Many models define various types of consistency• Sequential consistency, causal consistency, relaxed consistency, …
o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly
Again: Memory consistency issue example
What are the possible outcomes from the two following codes?o A and B are initially zero
o 1,2,3,4 or 3,4,1,2 etc : “01”
o 1,3,2,4 or 1,3,4,2 etc : “11”
o Can it print “10”, or “00”? Should it be able to?• Given cache coherency?
Processor 1:
1: A = 1;2: print B
Processor 2:
3: B = 1;4: print A
Why we need a consistency model
Given an in-order multicore processor, cache coherence should solve memory consistency automaticallyo But in-order leaves a lot of performance on the table
o e.g., Out-of-Order superscalar brings a lot of performance gains
A non-OoO inconsistency example: write bufferso In order to hide cache miss latency for writes, write requests are often stored in a
small “write buffer”, instead of waiting until the cache processing is done
o Requests in the write buffer may be applied out-of-order, depending on cache operation
o Remote core may see memory writes applied out of order! (e.g., “00”)Memory model is specified as a part of the ISA
From a single processor’s perspective, it looks like there is no dependency!
Why we need a consistency model
Once requests can be haphazardly reordered, algorithmic correctness becomes really hard to enforce!o e.g., How would we implement a mutex?
o If “locked=true” is buffered in a write buffer while a process is in a critical section…
Memory consistency model: “A specification of the allowed behavior of multithreaded programs executing with shared memory”
Memory model is defined, specified as part of the ISAo Compilers, programmers write/optimize software to work with the model
o Simpler model (typically) simpler for compiler, complex hardware
o Complex model (typically) harder for compiler, simple hardware
The most stringent model: “Sequential consistency” (SC)
The following is what the model requires, and the processor hardware must abide by
Within a core, all memory I/O is applied in program ordero Regardless of OoO or write buffers, memory I/O leaving the core should be re-
ordered according to program order
Across cores, all cores must see the same global ordero Typically solved already via cache coherency
Parallel program execution behaves “as you would expect”
Too stringent for efficient hardware implementation
SC requires I/O with no dependency to be ordered as well!
Weaker memory models for performance
To satisfy SC, 2 and 4 can’t execute until 1 and 3 become visible globallyo 2 and 4 stalled until writes flushed to L3 cache!
Instead, the two writes can be buffered in write buffers, and continue on to readso Reads will check write buffers as well as cache,
ensuring correct single-thread behavior
o Significant performance improvement*
o BUT, this makes “00” or “10” possible
L3 $
Core
L1 I$ L1 D$
L2 $
Package
Core
L1 I$ L1 D$
L2 $
Write buffers
Processor 1:
1: A = 1;2: print B
Processor 2:
3: B = 1;4: print A
*Exact impact has been debated, especially with speculation-based improvements on SC
A weaker memory model: “Total Store Order” (TSO)
Allows later reads to finish before earlier writeso Read and writes among themselves must maintain program order
o But ordering between reads and writes can be changed
o Basically, model allows write buffers
Now our example can emit both “00” and “10”
It is important that weaker models do not effect single-core ordering, or access to the same locationo Intra-core dependencies are always maintained
o Effects only the order in which memory I/O is visible to other cores
Processor 1:
1: A = 1;2: print B
Processor 2:
3: B = 1;4: print A
Aside: x86 seems to follow a modified version of TSO
Aside: Weaker memory models in the wild
A wide variety of weaker memory models existo Processor Consistency (PC)
o Release Consistency (RC)
o Eventual Consistency (EC), …
Some known memory modelso x86 seems to follow a modified form of TSO
o ARM model is notoriously underspecified
o RISC-V allows choice between RC and TSO
How beneficial are these memory models is debated…o TSO consistently outperforms SC, but everything beyond is muddy
Weaker memory models and synchronization
How do we implement synchronization in the face of memory re-ordering?
A new instruction “Fence” specifies where ordering must be enforcedo All memory operations before a fence must finish before fence is executed
• And visible to other processors/threads!
o All memory operations after a fence must wait until fence is executed
RISC-V base ISA includes two fence operationso FENCE, FENCE.I (for instruction memory)
So far…
Why parallelism?
Memory issues with parallelismo Cache coherence
o Memory consistency
Hardware support for synchronization
In parallel software, critical sections implemented via mutexes are critical for algorithmic correctness
Can we implement a mutex with the instructions we’ve seen so far?o e.g.,
while (lock==False);lock = True;// critical sectionlock = False;
o Does this work with parallel threads?
By chance, both threads can think lock is not takeno e.g., Thread 2 thinks lock is not taken, before thread 1 takes it
o Both threads think they have the lock
Hardware support for synchronization
while (lock==False);
lock = True;
// critical section
lock = False;
while (lock==False);
lock = True;
// critical section
lock = False;
Thread 1 Thread 2
Algorithmic solutions exist! Dekker’s algorithm, Lamport’s bakery algorithm…
Synchronization algorithm example:Dekker’s algorithm
The solution is attributed to Dutch mathematician Th. J. Dekker by Edsger W. Dijkstra in an unpublished paper
Does this still work with TSO?
Also, performance overhead…
Hardware support for synchronization
A high-performance solution is to add an “atomic instruction”o Memory read/write in a single instruction
o No other instruction can read/write between the atomic read/write
o e.g., “if (lock=False) lock=True”
Fences are algorithmically sufficient to implement atomic operations, but this is more efficient
Single instruction read/write is in the grey area of RISC paradigm…
RISC-V example
Atomic instructions are provided as part of the “A” (Atomic) extension
Two types of atomic instructionso Atomic memory operations (read, operation, write)
• operation: swap, add, or, xor, …
o Pair of linked read/write instructions, where write returns fail if memory has been written to after the read• More like RISC!
• With bad luck, may cause livelock, where writes always fail
Aside: It is known all synchronization primitives can be implemented with only atomic compare-and-swap (CAS)o RISC-V doesn’t define a CAS instruction though
Pipelined implementation of atomic operations
In a pipelined implementation, even a single-instruction read-modify-write can be interleaved with other instructionso Multiple cycles through the pipeline
Atomic memory operationso Modify cache coherence so that once an atomic operation starts, no other cache
can access it
o Other solutions?
Linked read/write operationso Need only to keep track of a small number of linked read addresses