Multi-processor Scheduling

Multi-processor Scheduling

Two implementation choices Single, global ready queue Per-processor run queue

Which is better?

Queue-per-processor

Advantages of queue per processor Promotes processor affinity (better cache locality) Removes a centralized bottleneck

Which runs in global memory Supported by default in Linux 2.6 Java 1.6 support: a double-ended queue

(java.util.Deque) Use a bounded buffer per consumer If nothing in a consumer’s queue, steal work from

somebody else If too much in the queue, push work somewhere else

Thread Implementation Issues

Andrew Whitaker

Where do Threads Come From?

A few choices: The operating system A user-mode library Some combination of the two…

Option #1: Kernel Threads

Threads implemented inside the OS Thread operations (creation, deletion,

yield) are system calls Scheduling handled by the OS scheduler

Described as “one-to-one” One user thread mapped to one

kernel thread Every invocation of Thread.start()

creates a kernel thread

process

OS threads

Option #2: User threads

Implemented as a library inside a process

All operations (creation, destruction, yield) are normal procedure calls

Described as “many-to-one” Many user-perceived threads map

to a single OS process/thread

process

OS thread

Process Address Space Review

Every process has a user stack and a program counter

In addition, each process has a kernel stack and program counter (not shown here)

code(text segment)

static data(data segment)

heap(dynamic allocated mem)

stack

SP

PC

code

(text segment)

static data(data segment)

heap(dynamic allocated mem)

thread 1 stack

PC (T2)

SP (T2)thread 2 stack

thread 3 stack

SP (T1)

SP (T3)

PC (T1)PC (T3)

Threaded Address Space

Every thread always has its own user stack and program counter For both user, kernel

threadsFor user threads, there

is only a single kernel stack, program counter, PCB, etc.

User address space (for both user and kernel threads)

User Threads vs. Kernel Threads

User threads are faster Operations do not pass through the OS

But, user threads suffer from: Lack of physical parallelism

Only run on a single processor! Poor performance with I/O

A single blocking operation stalls the entire application

For these reasons, most (all?) major OS’s provide some form of kernel threads

When Would User Threads Be Useful?The calculator?The web server?The Fibonacci GUI?

Option #3: Two-level Model

OS supports native multi-threading

And, a user library maps multiple user threads to a single kernel thread

“Many-to-many” Potentially captures the best of

both worlds Cheap thread operations Parallelism

process

OS threads

Problems with Many-to-Many Threads

Lack of coordination between user and kernel schedulers “Left hand not talking to the right”

Specific problems Poor performance

e.g., the OS preempts a thread holding a crucial lock Deadlock

Given K kernel threads, at most K user threads can block

• Other runnable threads are starved out!

Scheduler Activations, UW 1991

Add a layer of communication between kernel and user schedulers

Examples: Kernel tells user-mode that a task has blocked

User scheduler can re-use this execution context Kernel tells user-mode that a task is ready to resume

Allows the user scheduler to alter the user-thread/kernel-thread mapping

Supported by newest release of NetBSD

Implementation Spilling Over into the InterfaceIn practice, programmers have

learned to live with expensive kernel threads

For example, thread pools Re-use a static set of threads

throughout the lifetime of the program

Locks

Used for implementing critical sectionsModern languages (Java, C#) implicitly

acquire and release locks

interface Lock { public void acquire(); // only one thread allowed between an // acquire and a release public void release();}

Two Varieties of Locks

Spin locks Threads busy wait until the lock is freed

Thread stays in the ready/running state

Blocking locks Threads yield the processor until the lock is

freed Thread transitions to the blocked state

Why Use Spin Locks?

Spin Locks can be faster No context switching required

Sometimes, blocking is not an option For example, in the kernel scheduler

implementation

Spin locks are never used on a uniprocessor

Bogus Spin Lock Implementation

1. class SpinLock implements Lock {2. private volatile boolean isLocked = false;

3. public void acquire() {4. while (isLocked) { ; } // busy wait5. isLocked = true;6. }

7. public void release() {8. isLocked = false;9. }10. }

Multiple threads can acquire this lock!

Hardware Support for Locking

Problem: Lack of atomicity in testing and setting the isLocked flag

Solution: Hardware-supported atomic instructions e.g., atomic test-and-set

Java conveniently abstracts these primitives (AtomicInteger, and friends)

Corrected Spin Lock

1. class SpinLock implements Lock {2. private final AtomicBoolean isLocked = 3. new AtomicBoolean (false);

4. public void acquire() {5. // get the old value, set a new value6. while (isLocked.getAndSet(true)) { ; }7. }

8. public void release() {9. assert (isLocked.get() == true);10. isLocked.set(false);11. }12. }

Blocking Locks: Acquire ImplementationAtomically test-and-set locked statusIf lock is already held:

Set thread state to blocked Add PCB (task_struct) to a wait queue Invoke the scheduler

Problem: must ensurethread-safe access to the wait queue!

Disabling Interrupts

Prevents the processor from being interrupted Serves as a coarse-grained lock

Must be used with extreme care No I/O or timers can be processed

Thread-safe Blocking Locks

Atomically test-and-set locked statusIf lock is already held:

Set thread state to blocked Disable interrupts Add PCB (task_struct) to a wait queue Invoke the scheduler Next task re-enables interrupts

Disabling Interrupts on a MultiprocessorDisabling interrupts can be done locally or

globally (for all processors) Global disabling is extremely heavyweight

Linux: spin_lock_irq Disable interrupts on the local processor Grab a spin lock to lock out other processors

Preview For Next Week1. public class Example extends Thread {

2. private static int x = 1;3. private static int y = 1; 4. private static boolean ready = false;

5. public static void main(String[] args) {6. Thread t = new new Example(); 7. t.start();8. 9. x = 2;10. y = 2;11. ready = true;12. }

13. public void run() {14. while (! ready)15. Thread.yield(); // give up the processor16. System.out.println(“x= “ + x + “y= “ + y);17. }18. }

What Does This Program Print?

Answer: it’s a race condition. Many different outputs are possible x=2, y=2 x=1,y=2 x=2,y=1 x=1,y=1 Or, the program may print nothing!

The ready loop runs forever

Documents

Multi-processor Scheduling