Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues

Lecture 27Multiprocessor

Scheduling

• Last lecture: VMM• Two old problems: CPU virtualization and memory

virtualization

• I/O virtualization

• Today• Issues related to multi-core: scheduling and scalability

The cache coherence problem• Since we have multiple private caches:

How to keep the data consistent across caches?• Each core should perceive the memory as a

monolithic array, shared by all the cores

The cache coherence problem

Core 1 Core 2 Core 3 Core 4

One or more levels of

cachex=15213


cachex=15213


cache


cache

Main memoryx=15213

multi-core chip




cachex=21660


cachex=15213


cache


cache

Main memoryx=15213

multi-core chipassuming write-back caches




cachex=15213


cachex=15213


cache


cache

Main memoryx=15213

multi-core chip




cachex=21660


cachex=15213


cache


cache

Main memoryx=21660

multi-core chipassuming write-through caches

Solutions for cache coherence• There exist many solution algorithms, coherence

protocols, etc.

• A simple solution:Invalidation protocol with bus snooping

Inter-core bus



cache


cache


cache


cache

Main memory multi-core chip

inter-core bus

Invalidation protocol with snooping• Invalidation:

If a core writes to a data item, all other copies of this data item in other caches are invalidated• Snooping:

All cores continuously “snoop” (monitor) the bus connecting the cores.




cachex=15213


cachex=15213


cache


cache

Main memoryx=15213

multi-core chip




cachex=21660


cachex=15213


cache


cache

Main memoryx=21660


INVALIDATEDsendsinvalidationrequest




cachex=21660


cachex=21660


cache


cache

Main memoryx=21660


Alternative to invalidate protocol: update protocol



cachex=21660


cachex=15213


cache


cache

Main memoryx=21660


broadcastsupdatedvalue

Alternative to invalidate protocol: update protocol



cachex=21660


cachex=21660


cache


cache

Main memoryx=21660


broadcastsupdatedvalue

Invalidation vs update• Multiple writes to the same location• invalidation: only the first time• update: must broadcast each write

(which includes new variable value)

• Invalidation generally performs better:it generates less bus traffic

Programmers still Need to Worry about Concurrency• Mutex

• Condition variables

• Lock-free data structures

Single-QueueMultiprocessor Scheduling• reuse the basic framework for single processor

scheduling• put all jobs that need to be scheduled into a single

queue• pick the best two jobs to run, if there are two CPUs• Advantage: simple• Disadvantage: does not scale

SQMS and Cache Affinity

Cache Affinity• Thread migration is costly• Need to restart the execution pipeline• Cached data is invalidated• OS scheduler tries to avoid migration as much as

possible: it tends to keeps a thread on the same core

SQMS and Cache Affinity.

Multi-Queue Multiprocessor Scheduling

• Scalable• Cache affinity

Load Imbalance

• Migration

Work Stealing• A (source) queue that is low on jobs will

occasionally peek at another (target) queue• If the target queue is (notably) more full than the

source queue, the source will “steal” one or more jobs from the target to help balance load

• Cannot look around at other queues too often

Linux Multiprocessor Schedulers• Both approaches can be successful• O(1) scheduler• Completely Fair Scheduler (CFS)• BF Scheduler (BFS), uses a single queue

An Analysis of Linux Scalability to Many Cores• This paper asks whether traditional kernel designs

can be used and implemented in a way that allows applications to scale

Amdahl's Law• N: the number of threads of execution• B: the fraction of the algorithm that is strictly serial• the theoretical speedup:

Scalability Issues• Global lock used for a shared data structure

• longer lock wait time

• Shared memory location• overhead caused by the cache coherency algorithms

• Tasks compete for limited size-shared hardware cache • increased cache miss rates

• Tasks compete for shared hardware resources (interconnects, DRAMinterfaces)• more time wasted waiting

• Too few available tasks:• less efficiency

How to avoid/fix• These issues can often be avoided (or limited) using

popular parallel programming techniques• Lock-free algorithms• Per-core data structures• Fine-grained locking• Cache-alignment

• Sloppy Counters

Current bottlenecks

• https://www.usenix.org/conference/osdi10/analysis-linux-scalability-many-cores

Documents

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues