Cpeg421-10-F/Topic-3-II-EARTH1 Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more

cpeg421-10-F/Topic-3-II-EARTH 1

Topic 2 -- II: Compilers and Runtime Technology:

Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details)

Guang R. Gao

ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering

University of Delaware

[email protected]


Outline

• Overview

• Fine-grain multithreading

• Compiling for fine-grain multithreading

• The power of fine-grain synchronization - SSB

• The percolation model and its applications

• Summary


Outline

• Overview





• Summary

TheThe EARTH EARTH Multithreaded Execution Model


fiber within a frame

Aync. function invocation

A sync operation

Invoke a threaded func

Two Level of Fine-Grain Threads:- threaded procedures- fibers

22 22 11 22

11 22 22 44

Signal Token

Total # signals

Arrived # signals

EARTH vs. CILK


Fiber within a frame

Parallel function invocation frames

fork a procedure

SYNC ops

Note: EARTH has it origin in static dataflow model

EARTH Model CILK Model

The “Fiber” Execution Model


00 22 00 22

00 11 00 22 00 44

Signal TokenTotal # signals

Arrived # signals



11 22 00 22

00 11 00 22 00 44


Arrived # signals



22 22 00 22

00 11 00 22 00 44


Arrived # signals



22 22 00 22

11 11 00 22 00 44


Arrived # signals



22 22 00 22

11 11 11 22 00 44


Arrived # signals



22 22 11 22

11 11 11 22 00 44


Arrived # signals



22 22 22 22

11 11 11 22 00 44


Arrived # signals



22 22 22 22

11 11 22 22 00 44


Arrived # signals



22 22 22 22

11 11 22 22 11 44


Arrived # signals



22 22 22 22

11 11 22 22 22 44


Arrived # signals



22 22 22 22

11 11 22 22 33 44


Arrived # signals



22 22 22 22

11 11 22 22 44 44


Arrived # signals

A Loop Example


for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: …}

for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: …}

S1:

S2:

S3:

Sk:

i= 1 i= 2 i= 3 i= N

Note:

How loop carried dependencies are handled?And its implication on cross core software pipelining

T1 T2 T3

Main Features of EARTH

* Fast thread context switching

• Efficient parallel function invocation

• Good support of fine grain dynamic load balancing

* Efficient support split phase transactions and fibers


*Features unique to the EARTH model in comparison to the CILK model


Outline

• Overview





• Summary

Compiling C for EARTHObjectives

• Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C)

• Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C)

• Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs.


Summary of EARTH-C Extensions

• Explicit Parallelism– Parallel versus Sequential statement sequences

– Forall loops

• Locality Annotation– Local versus Remote Memory references (global, local,

replicate, …)

• Dynamic Load Balancing– Basic versus remote function and invocation sites


EARTH-C Compiler Environment


McCAT

EARTH-C Compiler

Threaded-C Compiler

C EARTH-C

EARTH SIMPLE

Threaded-C

Program Dependence Analysis

Program Dependence Analysis

Thread GenerationThread Generation

EARTH SIMPLE

Th

read P

artitionin

gT

hread

Partition

ing

Threaded-CEARTH Compilation EnvironmentThe EARTH Compiler

The McCAT/EARTH Compiler


EARTH-C

THREADED-C

EARTH-SIMPLE-C

EARTH-SIMPLE-C

Simplify goto eliminationLocal function inlining Points-to Analysis

Heap AnalysisR/W Set Analysis

Array Dependence Tester

Simplify goto eliminationLocal function inlining Points-to Analysis

Heap AnalysisR/W Set Analysis

Array Dependence Tester

Forall Loop DetectionLoop Partitioning

Forall Loop DetectionLoop Partitioning

Build Hierarchical DDGThread Generation

Build Hierarchical DDGThread Generation

Code GenerationCode Generation

04/18/23 \Petaflop\Workshop98-7B.ppt 25

If n < 2

DATA_RSYNC (1, result, done)

else

{

TOKEN (fib, n-1, & sum1, slot_1);

TOKEN (fib, n-2, & sum2, slot_2);

}

END_THREAD( ) ;

THREAD-1;

DATA_RSYNC (sum1 + sum2, result, done);

END_THREAD ( ) ;

END_FUNCTION

0 0

2 2

fibn result done

The Fibonacci Example


void main ( ){ int i, j, k; float sum;

for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; }}

Sequential Version

Matrix Multiplication


BLKMOV_SYNC (a, row_a, N, slot_1);

BLKMOV_SYNC (b, column_b, N, slot_1);

sum = 0;

END_THREAD;

THREAD-1;

for (i=0; i<N; i++ );

sum = sum + (row_a[i] * column_b[i]);

DATA_RSYNC (sum, result, done);

END_THREAD ( ) ;

0 0

2 2

innera result doneb

The Inner Product Example

END_FUNCTION

Summary of EARTH-C Extensions

• Explicit Parallelism– Parallel versus Sequential statement sequences

– Forall loops

• Locality Annotation– Local versus Remote Memory references (global, local,

replicate, …)

• Dynamic Load Balancing– Basic versus remote function and invocation sites


EARTH C Threaded C(Thread Generation)

Given a sequence of statements, s1, s2, …sn, we wish to create threads such that:– Maximize thread length (minimize thread

switching overhead)– retain sufficient parallelism– Issue remote memory requests as early as

possible (prefetching)– Compile split-phase remote memory operations

and remote function calls correctly


An Example


int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a;

b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); }

11

33

11

Example Partitioned into Four Fibers


a = x[i];fact = 1;

fact = fact * a;b = x[j];

sum = a + b;prod = a * b;r1 = g(sum);r2 = g(prod);r3 = g(fact);

return (r1 + r2 + r3);

Fiber-0:

Fiber-1:

Fiber-2:

Fiber-3:

Better Strategy Using List Scheduling

• Put each instruction in the earliest possible thread.

• Within a thread, the remote operations are executed as early as possible.

Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type.


Instruction Types

• Schedule First– remote_read, remote_write– remote_fn_call– local_simple– remote_compound– local_compound– basic_fn_call

• Schedule Lastcpeg421-10-F/Topic-3-II-EARTH 33

List Scheduling Previous Example


(0,RR)(0,RR) (0,RR)(0,RR) (0,LS)(0,LS)

(1,LS)(1,LS) (1,LS)(1,LS) (1,LC)(1,LC)

(1,RF)(1,RF) (1,RF)(1,RF) (1,RF)(1,RF)

(2,LS)(2,LS)

Resulting List Scheduled Threads


a=x[i];b=x[j];fact=1;

a=x[i];b=x[j];fact=1;

sum=a+b;r1=g(sum);prod=a*b;r2=g(prod);fact=fact*i;r3=g(fact)

sum=a+b;r1=g(sum);prod=a*b;r2=g(prod);fact=fact*i;r3=g(fact)

return (r1+r2+r3);return (r1+r2+r3);

22

33

Generating Threaded-C Code


THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j){

SLOTS SYNC_SLOTS[2];int a, b, sum, prod, fact, r1, r2, r3;

THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j){

SLOTS SYNC_SLOTS[2];int a, b, sum, prod, fact, r1, r2, r3;

/* THREAD_0:; */INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);GET_SYNC_L (&x[i], &a, 0);GET_SYNC_L (&x[j], &b, 0);fact = 1;END_THREAD( );

/* THREAD_0:; */INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);GET_SYNC_L (&x[i], &a, 0);GET_SYNC_L (&x[j], &b, 0);fact = 1;END_THREAD( );

THREAD_1:;sum = a + b;TOKEN (G, &r1, SLOT_ADR(1), sum);prod = a * b;TOKEN (g, &r2, SLOT_ADR(1), prod);fact = fact * a;TOKEN (g, &r3, SLOT_ADR(1), fact);END_THREAD( );

THREAD_1:;sum = a + b;TOKEN (G, &r1, SLOT_ADR(1), sum);prod = a * b;TOKEN (g, &r2, SLOT_ADR(1), prod);fact = fact * a;TOKEN (g, &r3, SLOT_ADR(1), fact);END_THREAD( );

THREAD_2:;DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);END_FUNCTION( );

}

THREAD_2:;DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);END_FUNCTION( );

}


Outline

• Overview





• Summary

Fine-Grain Synchronization: Two Types

Sync Type Enforce Mutual Exclusion

Enforce Data Dependencies

Order No Specific Order required

Uni-directional

Fine Grain Sync. Solution

• Software Fine grained locks• Lock free concurrent data structures• Full / Empty bits

• I-structures• Full / Empty bits


Enforce Data Dependencies

• A DoAcross loop with positive and constant dependence distance.


for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

In parallel iterations are assigned to different threads

T0T0 T1T1

(i = 2 + D){ A[2+D] = … … … = A[2]}

(i = 2 + D){ A[2+D] = … … … = A[2]}

(i = 2){ A[2] = … … … = A[2-D]}

(i = 2){ A[2] = … … … = A[2-D]}

The data dependence needs to be enforced by synchronization

Memory Based Fine-Grain Synchronization:

• Full/Empty Bits (HEP, Tera MTA, etc) & I-Structures (dataflow based machines)

• Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”.


I-Structure state transition[ArvindEtAl89 @ TOPLAS]

EmptyEmpty

FullFull DeferredDeferred

read

readwrite

reset

write

read

With Memory Based Fine-Grain Sync

• Using a single atomic operation complete synchronized write/read in memory directly

• No need to implement synchronization with other resources, e.g., shared memory.

• Low overhead: just one memory transaction


for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D]));}

for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D]));}

With Memory Based Fine-Grain Sync

• Using a single atomic operation complete synchronized write/read in memory directly

• No need to implement synchronization with other resources, e.g., shared memory.

• Low overhead: just one memory transaction


T1T1

(i = 2 + D){ write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));}

(i = 2 + D){ write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));}

T0T0

(i = 2){ write_sync(&(A[2]),…); … … = read_sync(&(A[2-D]));}

(i = 2){ write_sync(&(A[2]),…); … … = read_sync(&(A[2-D]));}

An Alternative: control-flow based synchronizations


• The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models

• You may need to worry about this:

A[i] = …;fence;post(i);

A[i] = …;fence;post(i);

wait(i-D);fence;… = A[i-D];

wait(i-D);fence;… = A[i-D];

for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D];}

for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D];}

No data dependencyNo data dependency

No data dependencyNo data dependency

For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS]

A Question!



Key ObservationKey Observation:

Solution:

What is SSB?

• A small hardware buffer attached to the memory controller of each memory bank.

• Record and manage states of actively synchronized data units.

• Hardware Cost– Each SSB is a small look-up table: Easy-to-implement

– Independence of each SSB: hardware cost increases only linearly proportional to # of memory banks



SSB on Many-Core (IBM C64)

IBM Cyclops-64, Designed by Monty Denneau.

SSB Synchronization Functionalities

Data Synchronization: Enforce RAW data dependencies• Support word-level

– Two single-writer-single-reader (SWSR) modes– One single-writer-multiple-reader (SWMR) mode

Fine-Grain Locking: Enforce mutual exclusion• Support word-level

– write lock (exclusive lock)– read lock (shared lock)– recursive lock

SSB is capable of supporting more functionality


Experimental Infrastructure


IBM Cyclops-64 Chip Architecture• 160 thread units (500MHz)• Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support• SSB for on-chip SRAM bank: 16-entry, 8-way associative

Cyclops-64 Micro Kernel

Simulation Testbed: FAST Simulator (Software) Ms. Clops Hardware Emulator

CCompiler

(GCC/Open64)

OpenMP Compiler

Binutils:

assembler

linkerLibraries:

OpenMP RTS

TiNy Threads Library/RTS

Std C/Math lib

SSB Fine-Grain Sync. is Efficient

• For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms.

• For example, with up to 128 threads– Livermore loop 6 (linear recurrence): a 312%

improvement over the barrier based version

– Ordered integer set (hash table): outperform the software-based fine-grain methods by up to 84%



Outline

• Overview





• Summary

Research LayoutFuture Programming Models


Advanced Execution / Programming Model

PercolationLocation

Consistency

Base Execution Model Fine Grain Multi

threading (e.g. EARTH, CARE)

Infrastructure & Tools•System Software•Simulation / Emulation•Analytical Modeling

HTMT like Architecture

Cellular Multithreaded Architecture(e.g. BG/c)

High End PIM Architecture

Percolation Model


Hig

h S

peed

C

PU

s

Hig

h S

peed

C

PU

s

SRA

M

PIM

SRA

M

PIM

DR

AM

PI

MD

RA

M

PIM

Primary Execution Engine

Prepare and percolate “parceled threads”

Perform intelligent memory operations

Global Memory Management

A User’s Perspective

The Percolation Model


• What is percolation?dynamic, adaptive computation/data movement, migration, transformation in-place or on-the fly to keep system resource usefully busy

• Features of percolation– both data and thread

may percolate– computation

reorganization and data layout reorganization

– asynchronous invocation

An Example of percolation—Cannon’s Algorithm

Level 0

Level 1

Level 2

Level 3

Level 0: fast cpu

Level 1 PIM

Level 2 PIM

Level 3

percolation

HTML-like Architectures

Cannon’s nearest neighbor data transferData layout reorganization during percolation

Performance of SCCA2Kernel 4


#threads C64 SMPs MTA2

4 2917082 5369740 752256

8 5513257 2141457 619357

16 9799661 915617 488894

32 17349325 362390 482681

• Reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32

• Commodity SMPs has poor performance• Competitive vs. MTA-2

Metric:TEPS -- Traversed Edges per second

SMPs: 4-way Xeon dual-core, 2MB L2 Cache


Outline

• Overview





• Summary

Documents

Cpeg421-10-F/Topic-3-II-EARTH1 Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more