44
CILK: An Efficient Multithreaded Runtime System

CILK: An Efficient Multithreaded Runtime System

  • Upload
    oke

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

CILK: An Efficient Multithreaded Runtime System. People. Project at MIT & now at UT Austin Bobby Blumofe (now UT Austin, Akamai) Chris Joerg Brad Kuszmaul (now Yale) Charles Leiserson (MIT, Akamai) Keith Randall (Bell Labs) Yuli Zhou (Bell Labs). Outline. Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: CILK: An Efficient Multithreaded Runtime System

CILK: An Efficient Multithreaded Runtime System

Page 2: CILK: An Efficient Multithreaded Runtime System

People

Project at MIT & now at UT Austin– Bobby Blumofe (now UT Austin, Akamai)

– Chris Joerg

– Brad Kuszmaul (now Yale)

– Charles Leiserson (MIT, Akamai)

– Keith Randall (Bell Labs)

– Yuli Zhou (Bell Labs)

Page 3: CILK: An Efficient Multithreaded Runtime System

Outline

Introduction Programming environment The work-stealing thread scheduler Performance of applications Modeling performance Proven Properties Conclusions

Page 4: CILK: An Efficient Multithreaded Runtime System

Introduction

Why multithreading?To implement dynamic, asynchronous,

concurrent programs. Cilk programmer optimizes:

– total work– critical path

A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

Page 5: CILK: An Efficient Multithreaded Runtime System

Introduction ...

Page 6: CILK: An Efficient Multithreaded Runtime System

Introduction ...

Cilk program is a set of procedures

A procedure is a sequence of threads

Cilk threads are:

– represented by nodes in the dag

– Non-blocking: run to completion: no waiting

or suspension: atomic units of execution

Page 7: CILK: An Efficient Multithreaded Runtime System

Introduction ...

Threads can spawn child threads

– downward edges connect a parent to its

children

A child & parent can run concurrently.

– Non-blocking threads a child cannot return a

value to its parent.

– The parent spawns a successor that receives

values from its children

Page 8: CILK: An Efficient Multithreaded Runtime System

Introduction ...

A thread & its successor are parts of the same Cilk procedure.– connected by horizontal arcs

Children’s returned values are received before their successor begins: – They constitute data dependencies.

– Connected by curved arcs

Page 9: CILK: An Efficient Multithreaded Runtime System

Introduction ...

Page 10: CILK: An Efficient Multithreaded Runtime System

Introduction: Execution Time

Execution time of a Cilk program using P processors depends on:– Work (T1): time for Cilk program with 1

processor to complete.

– Critical path (T): the time to execute

the longest directed path in the dag.

– TP >= T1 / P (not true for some searches)

– TP >= T

Page 11: CILK: An Efficient Multithreaded Runtime System

Introduction: Scheduling

Cilk uses run time scheduling called work stealing.

Works well on dynamic, asynchronous, MIMD-style programs.

For “fully strict” programs, Cilk achieves asymptotic optimality for:

space, time, & communication

Page 12: CILK: An Efficient Multithreaded Runtime System

Introduction: language

Cilk is an extension of C

Cilk programs are:

– preprocessed to C

– linked with a runtime library

Page 13: CILK: An Efficient Multithreaded Runtime System

Programming Environment

Declaring a thread:

thread T ( <args> ) { <stmts> }

T is preprocessed into a C function of 1

argument and return type void.

The 1 argument is a pointer to a

closure

Page 14: CILK: An Efficient Multithreaded Runtime System

Environment: Closure

A closure is a data structure that has:

– a pointer to the C function for T

– a slot for each argument (inputs & continuations)

– a join counter: count of the missing argument values

A closure is ready when join counter == 0.

A closure is waiting otherwise.

They are allocated from a runtime heap

Page 15: CILK: An Efficient Multithreaded Runtime System

Environment: Continuation

A Cilk continuation is a data type, denoted by the keyword cont.

cont int x; It is a global reference to an empty

slot of a closure. It is implemented as 2 items:

– a pointer to the closure; (what thread)– an int value: the slot number. (what

input)

Page 16: CILK: An Efficient Multithreaded Runtime System

Environment: Closure

Page 17: CILK: An Efficient Multithreaded Runtime System

Environment: spawn

To spawn a child, a thread creates its closure:

spawn T (<args> )– creates child’s closure

– sets available arguments

– sets join counter

To specify a missing argument, prefix with a “?”

spawn T (k, ?x);

Page 18: CILK: An Efficient Multithreaded Runtime System

Environment: spawn_next

A successor thread is spawned the

same way as a child, except the

keyword spawn_next is used:

spawn_next T(k, ?x)

Children typically have no missing

arguments; successors do.

Page 19: CILK: An Efficient Multithreaded Runtime System

Explicit continuation passing

Nonblocking threads a parent cannot block on children’s results.

It spawns a successor thread. This communication paradigm is

called explicit continuation passing. Cilk provides a primitive to send a

value from one closure to another.

Page 20: CILK: An Efficient Multithreaded Runtime System

send_argument

Cilk provides the primitivesend_argument( k, value )sends value to the argument slot of a

waiting closure specified by continuation k.

spawn

spawn_next

send_argument

parent

child

successor

Page 21: CILK: An Efficient Multithreaded Runtime System

Cilk Procedure for computing a Fibonacci

numberthread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 );

}}thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

Page 22: CILK: An Efficient Multithreaded Runtime System

Nonblocking Threads:

Advantages

Shallow call stack. (for us: fault tolerance )

Simplify runtime system:

Completed threads leave C runtime stack empty.

Portable runtime implementation

Page 23: CILK: An Efficient Multithreaded Runtime System

Nonblocking Threads: Disdvantages

Burdens programmer with explicit

continuation passing.

Page 24: CILK: An Efficient Multithreaded Runtime System

Work-Stealing Scheduler The concept of work-stealing goes at

least as far back as 1981. Work-stealing:

– a process with no work selects a victim from which to get work.

– it gets the shallowest thread in the victim’s spawn tree.

In Cilk, thieves choose victims randomly.

Page 25: CILK: An Efficient Multithreaded Runtime System

Thread Level

Page 26: CILK: An Efficient Multithreaded Runtime System

Stealing Work: The Ready Deque

Each closure has a level:– level( child ) = level( parent ) + 1

– level( successor ) = level( parent )

Each processor maintains a ready deque:– Contains ready closures

– The Lth element contains the list of all ready closures whose level is L.

Page 27: CILK: An Efficient Multithreaded Runtime System

Ready deque

if ( ! readyDeque .isEmpty()

)

take deepest thread

else

steal shallowest thread

from readyDeque of

randomly selected victim

Page 28: CILK: An Efficient Multithreaded Runtime System

Why Steal Shallowest closure?

Shallow threads probably produce more work,

therefore, reduce communication.

Shallow threads more likely to be on critical

path.

Page 29: CILK: An Efficient Multithreaded Runtime System

Readying a Remote Closure

If a send_argument makes a remote closure

ready,

put closure on sending processor’s readyDeque

extra communication.

– Done to make scheduler provably good

– Putting on local readyDeque works well in practice.

Page 30: CILK: An Efficient Multithreaded Runtime System

Performance of Application

Tserial = time for C program

T1 = time for 1-processor Cilk program

Tserial /T1 = efficiency of the Cilk program

– Efficiency is close to 1 for programs with

moderately long threads: Cilk overhead is small.

Page 31: CILK: An Efficient Multithreaded Runtime System

Performance of Applications

T1/TP = speedup

T1/ T = average parallelism

If average parallelism is large

then speedup is nearly perfect.

If average parallelism is small

then speedup is much smaller.

Page 32: CILK: An Efficient Multithreaded Runtime System

Performance Data

Page 33: CILK: An Efficient Multithreaded Runtime System

Performance of Applications

Application speedup = efficiency X

speedup

= ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP

Page 34: CILK: An Efficient Multithreaded Runtime System

Modeling Performance

TP >= max( T , T1 / P )

A good scheduler should come

close to these lower bounds.

Page 35: CILK: An Efficient Multithreaded Runtime System

Modeling Performance

Empirical data suggests that for Cilk:

TP c1 T1 / P + c T ,

where c1 1.067 & c 1.042

If T1 / T > 10P

then critical path does not affect TP.

Page 36: CILK: An Efficient Multithreaded Runtime System

Proven Property: Time

Time: Including overhead,

TP = O( T1/P + T ),

which is asymptotically optimal

Page 37: CILK: An Efficient Multithreaded Runtime System

Conclusions We can predict the performance of a Cilk

program by observing machine-independent characteristics: – Work

– Critical path

when the program is fully-strict. Cilk’s usefulness is unclear for other

kinds of programs (e.g., iterative programs).

Page 38: CILK: An Efficient Multithreaded Runtime System

Conclusions ...

Explicit continuation passing a

nuisance.

It subsequently was removed (with more

clever pre-processing).

Page 39: CILK: An Efficient Multithreaded Runtime System

Conclusions ...

Great system research has a theoretical underpinning.

Such research identifies important properties– of the systems themselves, or– of our ability to reason about them formally.

Cilk identified 3 significant system properties:– Fully strict programs– Non-blocking threads– Randomly choosing a victim.

Page 40: CILK: An Efficient Multithreaded Runtime System

END

Page 41: CILK: An Efficient Multithreaded Runtime System

The Cost of Spawns

A spawn is about an order of magnitude more

costly than a C function call.

Spawned threads running on parent’s processor

can be implemented more efficiently than

remote spawns.

– This usually is the case.

Compiler techniques can exploit this distinction.

Page 42: CILK: An Efficient Multithreaded Runtime System

Communication Efficiency

A request is an attempt to steal work

(the victim may not have work).

Requests/processor & steals/processor

both grow as the critical path grows.

Page 43: CILK: An Efficient Multithreaded Runtime System

Proven Properties: Space

A fully strict program’s threads send arguments only to its parent’s successors.

For such programs, space, time, & communication bounds are proven.

Space: SP <= S1 P.

– There exists a P-processor execution for which this is asymptotically optimal.

Page 44: CILK: An Efficient Multithreaded Runtime System

Proven Properties: Communication

Communication: The expected # of bits

communicated in a P-processor execution is:

O( T P SMAX )

where SMAX denotes its largest closure.

There exists a program such that, for all P, there

exists a P-processor execution that communicates

k bits, where k > c T P SMAX, for some constant, c.