Upload
randall-neal
View
233
Download
0
Embed Size (px)
DESCRIPTION
Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues related to multi-core: scheduling and scalability
Citation preview
Lecture 27Multiprocessor
Scheduling
• Last lecture: VMM• Two old problems: CPU virtualization and memory
virtualization
• I/O virtualization
• Today• Issues related to multi-core: scheduling and scalability
The cache coherence problem• Since we have multiple private caches:
How to keep the data consistent across caches?• Each core should perceive the memory as a
monolithic array, shared by all the cores
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chipassuming write-back caches
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
Solutions for cache coherence• There exist many solution algorithms, coherence
protocols, etc.
• A simple solution:Invalidation protocol with bus snooping
Inter-core bus
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memory multi-core chip
inter-core bus
Invalidation protocol with snooping• Invalidation:
If a core writes to a data item, all other copies of this data item in other caches are invalidated• Snooping:
All cores continuously “snoop” (monitor) the bus connecting the cores.
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
INVALIDATEDsendsinvalidationrequest
The cache coherence problem
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
Alternative to invalidate protocol: update protocol
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
broadcastsupdatedvalue
Alternative to invalidate protocol: update protocol
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
broadcastsupdatedvalue
Invalidation vs update• Multiple writes to the same location• invalidation: only the first time• update: must broadcast each write
(which includes new variable value)
• Invalidation generally performs better:it generates less bus traffic
Programmers still Need to Worry about Concurrency• Mutex
• Condition variables
• Lock-free data structures
Single-QueueMultiprocessor Scheduling• reuse the basic framework for single processor
scheduling• put all jobs that need to be scheduled into a single
queue• pick the best two jobs to run, if there are two CPUs• Advantage: simple• Disadvantage: does not scale
SQMS and Cache Affinity
Cache Affinity• Thread migration is costly• Need to restart the execution pipeline• Cached data is invalidated• OS scheduler tries to avoid migration as much as
possible: it tends to keeps a thread on the same core
SQMS and Cache Affinity.
Multi-Queue Multiprocessor Scheduling
• Scalable• Cache affinity
Load Imbalance
• Migration
Work Stealing• A (source) queue that is low on jobs will
occasionally peek at another (target) queue• If the target queue is (notably) more full than the
source queue, the source will “steal” one or more jobs from the target to help balance load
• Cannot look around at other queues too often
Linux Multiprocessor Schedulers• Both approaches can be successful• O(1) scheduler• Completely Fair Scheduler (CFS)• BF Scheduler (BFS), uses a single queue
An Analysis of Linux Scalability to Many Cores• This paper asks whether traditional kernel designs
can be used and implemented in a way that allows applications to scale
Amdahl's Law• N: the number of threads of execution• B: the fraction of the algorithm that is strictly serial• the theoretical speedup:
Scalability Issues• Global lock used for a shared data structure
• longer lock wait time
• Shared memory location• overhead caused by the cache coherency algorithms
• Tasks compete for limited size-shared hardware cache • increased cache miss rates
• Tasks compete for shared hardware resources (interconnects, DRAMinterfaces)• more time wasted waiting
• Too few available tasks:• less efficiency
How to avoid/fix• These issues can often be avoided (or limited) using
popular parallel programming techniques• Lock-free algorithms• Per-core data structures• Fine-grained locking• Cache-alignment
• Sloppy Counters
Current bottlenecks
• https://www.usenix.org/conference/osdi10/analysis-linux-scalability-many-cores