CS152: Computer Systems Architecture Multiprocessing and

CS152: Computer Systems ArchitectureMultiprocessing and Parallelism

Sang-Woo Jun

Winter 2019

Large amount of material adapted from MIT 6.004, “Computation Structures”,Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”,

and CS 152 Slides by Isaac Scherson

High performance via parallelism

Given a fixed amount of work, we want to do as much work at once as possible, to reduce the amount of time it takeso Subject to computing power restrictions

o Subject to inherently exploitable parallelism in the work

What we looked at: Instruction-Level Parallelismo Pipelining works on multiple instructions, each at different stages

• Subject to pipeline depth, subject to dependencies between instructions

o Superscalar works on multiple instructions, even some on same stages• Subject to replicated pipeline count (ways), subject to dependencies between instructions

Some more exploitable parallelism

Data parallelismo When processing on multiple data elements do not have much dependency

between each other, they can be done in parallel

o (Typically same work on multiple data)

Task parallelismo Work can be divided into multiple independent processes, which do not have

much dependency between each other

o (Typically different work)

Savni, “Data parallelism in matrix multiplication”

Of course distinction is often muddled in the real world

How much performance improvement?

Amdahl’s law (Gene Amdahl, 1967)o Theoretical speedup for a given job as system resources are improved

o Speedup = 1

1−𝑝 +(𝑝

𝑠)

• p is the fraction of the parallelizable portion of the job

• s is the improvement of system resources (e.g., processor count)

o If p is zero, there is no speedup no matter how much processors we have!

o If p is 1, we will have perfect scaling of performance with more processors

Assume no synchronization overhead for upper bound estimation

How much performance improvement?

Example: Can we get 90× speedup using 100 processors?

o Remember: Speedup = 1

1−𝑝 +(𝑝

𝑠)

o s = 100 with 100 processors, and we want speedup to be 90

o Solving: Fparallelizable = 0.999

Need non-parallelizable part to be 0.1% of original time

90/100F)F(1

1Speedup

ableparallelizableparalleliz

Scaling Example

Workload: sum of 10 scalars, and 10 × 10 matrix sum

Single processor: Time = (10 + 100) × tadd

10 processorso Assuming adding 10 scalars can’t be parallelizedo Time = 10 × tadd + 100/10 × tadd = 20 × tadd

o Speedup = 110/20 = 5.5 (55% of potential peak)

100 processorso Time = 10 × tadd + 100/100 × tadd = 11 × tadd

o Speedup = 110/11 = 10 (10% of potential peak)

Scaling Example

What if matrix size is 100 × 100?

Single processor: Time = (10 + 10000) × tadd


o Speedup = 10010/1010 = 9.9 (99% of potential peak)


o Speedup = 10010/110 = 91 (91% of potential peak)

Why focus on parallelism?

Of course, large jobs require large machines with many processorso Exploiting parallelism to make the best use of supercomputers have always been

an extremely important topic

But now even desktops and phones are multicore!o Why? The end of “Dennard Scaling”

What happened?

Dennard Scaling

Robert Dennard, 1974

“Power density stays constant as transistors get smaller”

Intuitively:o Smaller transistors → shorter propagation delay → faster frequency

o Smaller transistors → smaller capacitance → lower voltage

o 𝑃𝑜𝑤𝑒𝑟 ∝ 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒2 × 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

o Moore’s law → Faster performance @ Constant power!

Unfortunately, as transistors became too small, there was a limit to how low voltage can go

Option 1: Continue Scaling Frequency at Increased Power Budget

0.014 μ

Option 2: Stop Frequency Scaling

Dennard Scaling Ended(~2006)

Danowitz et.al., “CPU DB: Recording Microprocessor History,” Communications of the ACM, 2012

Looking Back: Change of Predictions

But Moore’s Law Continues Beyond 2006

State of Things at This Point (2006)

Single-thread performance scaling endedo Frequency scaling ended (Dennard Scaling)

o Instruction-level parallelism scaling stalled … also around 2005

Moore’s law continueso Double transistors every two years

o What do we do with them?

K. Olukotun, “Intel CPU Trends”

Instruction Level Parallelism

Crisis Averted With Manycores?

Crisis Averted With Manycores?

We’ll get back to this point later. For now, multiprocessing!

The hardware for parallelism:Flynn taxonomy (1966) recap

Data Stream

Single Multi

InstructionStream

Single SISD(Single-Core Processors)

SIMD (GPUs, Intel SSE/AVX extensions, …)

Multi MISD(Systolic Arrays, …)

MIMD(VLIW, Parallel Computers)

Flynn taxonomy

Single-InstructionSingle-Data(Single-Core Processors)

Multi-InstructionSingle-Data(Systolic Arrays,…)

Single-InstructionMulti-Data(GPUs, Intel SIMD Extensions)

Multi-InstructionMulti-Data(Parallel Processors)

Today

Shared memory multiprocessor

SMP: shared memory multiprocessoro Hardware provides single physical

address space for all processors

o Synchronize shared variables using locks

o Memory access time• UMA (uniform) vs. NUMA (nonuniform)

Also often SMP: Symmetric multiprocessoro The processors in the system are identical, and are treated equally

Typical chip-multiprocessor (“multicore”) consumer computers

Memory System Architecture

L3 $

QPI / UPI

DRAM DRAM

Core

L1 I$ L1 D$

L2 $

Package

Core

L1 I$ L1 D$

L2 $

L3 $

Core

L1 I$ L1 D$

L2 $

Package

Core

L1 I$ L1 D$

L2 $

UMA between cores sharing a package,But NUMA across cores in different packages.Overall, this is a NUMA system

Memory System Bandwidth Snapshot (2020)

QPI / UPI

DRAM DRAM

Processor Processor

DDR4 2666 MHz128 GB/s

Ultra Path InterconnectUnidirectional

20.8 GB/s

Cache Bandwidth Estimate64 Bytes/Cycle ~= 200 GB/s/Core

Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-dieAll sorts of things are now on-die! Even network controllers!

Memory system issues with multiprocessing

Suppose two CPU cores share a physical address spaceo Distributed caches (typically L1)

o Write-through caches, but same problem for write-back as well

Time

step

Event CPU A’s

cache

CPU B’s

cache

Memory

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

Wrong data!

Memory system issues with multiprocessing

What are the possible outcomes from the two following codes?o A and B are initially zero

o 1,2,3,4 or 3,4,1,2 etc : “01”

o 1,3,2,4 or 1,3,4,2 etc : “11”

o Can it print “10”, or “00”? Should it be able to?

Processor 1:

1: A = 1;2: print B

Processor 2:

3: B = 1;4: print A

Memory problems with multiprocessing

Cache coherency (The two CPU example)o Informally: Read to each address must return the most recent value

o Complex and difficult with many processors

o Typically: All writes must be visible at some point, and in proper order

Memory consistency (The two processes example)o How updates to different addresses become visible (to other processors)

o Many models define various types of consistency• Sequential consistency, causal consistency, relaxed consistency, …

o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly

Keeping multiple caches coherent

A bad idea: Broadcast all writes to all other coreso Lots of unnecessary on-chip network traffic

o Each cache needs to do many lookups to see if incoming write notifications need to be applied• Further bad idea: applying all write requests result in all caches have copies of all data

No point of having distributed cache!

Keeping multiple caches coherent

Basic ideao If a cache line is only read, many caches can have a copy

o If a cache line is written to, only one cache at a time may have a copy

Typically two ways of implementing thiso “Snooping-based”: All cores listen to requests made by others on the memory bus

o “Directory-based”: All cores consult a separate entity called “directory” for each cache access

All caches listen (“snoop”) to the traffic on the memory buso There is enough information just from that to ensure coherence

Snoopy cache coherence

Core Core Core

Cache Cache Cache

Shared Cache

DRAM

Memory bus

Detour: A bus interconnect

A bus is an interconnect that connects multiple entitieso Cores, memory, (peripheral devices…)

Clock-synchronous, shared resourceo Only one entity may be transmitting at any given clock cycle

o All data transfers are broadcast, and all entities on the bus can listen to all communication

o If multiple entities want to send data (a “multi-master” configuration) a separate entity called the “bus arbiter” must assign which master can write at a given cycle

Contrary to switched interconnects like packet-switched, or crossbaro We’ll talk about these later!

MSI: A simple cache coherency protocol

Each cache line can be in one of three stateso M (“Modified”): Dirty data, no other cache has copy

• Safe to write to cache. DRAM is out-of-date

o S (“Shared”): Not dirty data, other caches may have copy

o I (“Invalid”): Data not in cache

Cache tag augmented with state information

Core Core Core

Cache Cache Cache

Snoopy implementation of MSI

Originally, each cache could send two requests on the memory buso Read: For both read and write requests from core (write-back)

o Flush: For cache evictions

Cache is modified with two more requestso Read: For read requests from core

o ReadExclusive: For write requests from core• Data is read from DRAM to cache, with the intention to update

o Upgrade: When a previously read cache line will be written to

o Flush: Writeback to DRAM, in the case of cache evictions


Core wants to read: Read from DRAM (via “Read” command), set to “S”

Core wants to write: o If cache has data: Send “Upgrade” command, set to “M”

o If not: Read from DRAM (via “Read exclusive” command), set to “M”

Snoops “Read” command from another core:o If I have data as “S”, supply data

o If I have data as “M”, flush and downgrade myself to “S”

Snoops “ReadExclusive” or “Upgrade”:o If I have data as “S”: invalidate

o If I have data as “M”: flush, invalidateImpossible to snoop “Upgrade” while I am “M”.I should be the only one with this data while I am “M”


Jos van Eijndhoven, et. al., “Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices”

Back to coherence example

Writes invalidate other caches, ensuring correct coherenceo This time assuming write-back

Time

step

Event CPU A’s

cache

CPU B’s

cache

Memory

0 0

1 CPU A reads X 0 (S) 0

2 CPU B reads X 0 (S) 0 (S) 0

3 CPU A writes 1 to X 1 (M) (I) 0

4 CPU B reads X 1 (S) 1 (S) 1

Beyond MSI

Many improvements on the MSI protocol proposed, implementedo MESI (adds “exclusive”), MOESI (adds “owner”), MESIF (adds “forward”)

Example: MESIo Addresses situation where data is read, and then updated in memory by one core

• Very common situation! (LW, process, SW)

• MSI would send Read, Upgrade commands, but if no other processes access this cache line, perhaps upgrade is unnecessary

o When MESI snoops a Read command, each core checks its cache to see if it has that line• If cache exists in another core, data is copied between caches (new command)

• If not, cache is in “E” state, which can upgrade to “M” without consulting others

Cache-to-cache data copy reduces main memory accesses in practice, performance improvement!

Limitations of snoopy caches

Inter-core cache coherence communications are all broadcasto Memory bus becomes a single point of serialization

o Not very scalable beyond a handful of cores

Perhaps not that big of an issue with desktop-class CPUso With ~8 cores and typical ratio of memory instructions, bus not yet the primary

bottleneck

o At least Intel Haswell seems to be still using snoopy• Rumored to use adaptive MESIF protocol

Directory-based cache coherence

Uses a separate entity called a “directory” which keeps track of the cache line state (MSI?)o Cache accesses first consult the directory, which either interacts with DRAM or

remote caches

o Directory can be implemented in a distributed way, to disperse the bottleneck

Pros of directory-based coherenceo Given a scalable interconnect, much more scalable than a bus

Cons of directory-based coherenceo Much more complex than snoopy, involving circuit implementations of distributed

algorithms

We won’t cover this in detail right now…

Performance issue with cache coherence:False sharing

Different memory locations, written to by different cores, mapped to same cache lineo Core 1 performing “results[0]++;”

o Core 2 performing “results[1]++;”

Remember cache coherenceo Every time a cache is written to, all other instances need to be invalidated!

o “results” variable is ping-ponged across cache coherence every time

o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)

Solution: Store often-written data in local variables

Some performance numbers with false sharing

Memory problems with multiprocessing

Cache coherency (The two CPU example)o Informally: Read to each address must return the most recent value

o Complex and difficult with many processors

o Typically: All writes must be visible at some point, and in proper order

Memory consistency (The two processes example)o How updates to different addresses become visible (to other processors)

o Many models define various types of consistency• Sequential consistency, causal consistency, relaxed consistency, …

o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly

Again: Memory consistency issue example

What are the possible outcomes from the two following codes?o A and B are initially zero

o 1,2,3,4 or 3,4,1,2 etc : “01”

o 1,3,2,4 or 1,3,4,2 etc : “11”

o Can it print “10”, or “00”? Should it be able to?• Given cache coherency?

Processor 1:

1: A = 1;2: print B

Processor 2:

3: B = 1;4: print A

Why we need a consistency model

Given an in-order multicore processor, cache coherence should solve memory consistency automaticallyo But in-order leaves a lot of performance on the table

o e.g., Out-of-Order superscalar brings a lot of performance gains

A non-OoO inconsistency example: write bufferso In order to hide cache miss latency for writes, write requests are often stored in a

small “write buffer”, instead of waiting until the cache processing is done

o Requests in the write buffer may be applied out-of-order, depending on cache operation

o Remote core may see memory writes applied out of order! (e.g., “00”)Memory model is specified as a part of the ISA

From a single processor’s perspective, it looks like there is no dependency!

Why we need a consistency model

Once requests can be haphazardly reordered, algorithmic correctness becomes really hard to enforce!o e.g., How would we implement a mutex?

o If “locked=true” is buffered in a write buffer while a process is in a critical section…

Memory consistency model: “A specification of the allowed behavior of multithreaded programs executing with shared memory”

Memory model is defined, specified as part of the ISAo Compilers, programmers write/optimize software to work with the model

o Simpler model (typically) simpler for compiler, complex hardware

o Complex model (typically) harder for compiler, simple hardware

The most stringent model: “Sequential consistency” (SC)

The following is what the model requires, and the processor hardware must abide by

Within a core, all memory I/O is applied in program ordero Regardless of OoO or write buffers, memory I/O leaving the core should be re-

ordered according to program order

Across cores, all cores must see the same global ordero Typically solved already via cache coherency

Parallel program execution behaves “as you would expect”

Too stringent for efficient hardware implementation

SC requires I/O with no dependency to be ordered as well!

Weaker memory models for performance

To satisfy SC, 2 and 4 can’t execute until 1 and 3 become visible globallyo 2 and 4 stalled until writes flushed to L3 cache!

Instead, the two writes can be buffered in write buffers, and continue on to readso Reads will check write buffers as well as cache,

ensuring correct single-thread behavior

o Significant performance improvement*

o BUT, this makes “00” or “10” possible

L3 $

Core

L1 I$ L1 D$

L2 $

Package

Core

L1 I$ L1 D$

L2 $

Write buffers

Processor 1:

1: A = 1;2: print B

Processor 2:

3: B = 1;4: print A

*Exact impact has been debated, especially with speculation-based improvements on SC

A weaker memory model: “Total Store Order” (TSO)

Allows later reads to finish before earlier writeso Read and writes among themselves must maintain program order

o But ordering between reads and writes can be changed

o Basically, model allows write buffers

Now our example can emit both “00” and “10”

It is important that weaker models do not effect single-core ordering, or access to the same locationo Intra-core dependencies are always maintained

o Effects only the order in which memory I/O is visible to other cores

Processor 1:

1: A = 1;2: print B

Processor 2:

3: B = 1;4: print A

Aside: x86 seems to follow a modified version of TSO

Aside: Weaker memory models in the wild

A wide variety of weaker memory models existo Processor Consistency (PC)

o Release Consistency (RC)

o Eventual Consistency (EC), …

Some known memory modelso x86 seems to follow a modified form of TSO

o ARM model is notoriously underspecified

o RISC-V allows choice between RC and TSO

How beneficial are these memory models is debated…o TSO consistently outperforms SC, but everything beyond is muddy

Weaker memory models and synchronization

How do we implement synchronization in the face of memory re-ordering?

A new instruction “Fence” specifies where ordering must be enforcedo All memory operations before a fence must finish before fence is executed

• And visible to other processors/threads!

o All memory operations after a fence must wait until fence is executed

RISC-V base ISA includes two fence operationso FENCE, FENCE.I (for instruction memory)

So far…

Why parallelism?

Memory issues with parallelismo Cache coherence

o Memory consistency

Hardware support for synchronization

In parallel software, critical sections implemented via mutexes are critical for algorithmic correctness

Can we implement a mutex with the instructions we’ve seen so far?o e.g.,

while (lock==False);lock = True;// critical sectionlock = False;

o Does this work with parallel threads?

By chance, both threads can think lock is not takeno e.g., Thread 2 thinks lock is not taken, before thread 1 takes it

o Both threads think they have the lock


while (lock==False);

lock = True;

// critical section

lock = False;

while (lock==False);

lock = True;

// critical section

lock = False;

Thread 1 Thread 2

Algorithmic solutions exist! Dekker’s algorithm, Lamport’s bakery algorithm…

Synchronization algorithm example:Dekker’s algorithm

The solution is attributed to Dutch mathematician Th. J. Dekker by Edsger W. Dijkstra in an unpublished paper

Does this still work with TSO?

Also, performance overhead…


A high-performance solution is to add an “atomic instruction”o Memory read/write in a single instruction

o No other instruction can read/write between the atomic read/write

o e.g., “if (lock=False) lock=True”

Fences are algorithmically sufficient to implement atomic operations, but this is more efficient

Single instruction read/write is in the grey area of RISC paradigm…

RISC-V example

Atomic instructions are provided as part of the “A” (Atomic) extension

Two types of atomic instructionso Atomic memory operations (read, operation, write)

• operation: swap, add, or, xor, …

o Pair of linked read/write instructions, where write returns fail if memory has been written to after the read• More like RISC!

• With bad luck, may cause livelock, where writes always fail

Aside: It is known all synchronization primitives can be implemented with only atomic compare-and-swap (CAS)o RISC-V doesn’t define a CAS instruction though

Pipelined implementation of atomic operations

In a pipelined implementation, even a single-instruction read-modify-write can be interleaved with other instructionso Multiple cycles through the pipeline

Atomic memory operationso Modify cache coherence so that once an atomic operation starts, no other cache

can access it

o Other solutions?

Linked read/write operationso Need only to keep track of a small number of linked read addresses

Documents

CS152: Computer Systems Architecture Multiprocessing and