Advanced Shared-Memory Programming · Advanced Shared-Memory Programming ... (Ansi C) Co-Array ... All research, no wide-spread solution on industry level

Advanced Shared-Memory Programming Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube

Shared-Memory Parallelism

■  Libraries and language extensions for standard languages □  OpenMP for C / C++

□  Java / .NET concurrency functionality □  Cilk derivate of C++

■  Different levels of abstractions □  Process model, thread model, task model

■  Specialized programming languages ...

□  ... for NUMA systems: Partitioned Global Address Space (PGAS) □  ... for computational applications: Fortran / HPF □  ... for implicit parallelism: Functional languages

■  Profiling

2

PGAS Languages

■  Typically, the developer is shielded from memory hierarchy aspects through the operating system (PRAM thinking)

■  Partitioned global address space (PGAS) approach □  Driven by high-performance computing community

□  Modern approach for large-scale NUMA □  Precondition for Exascale computing

■  Explicit notion of memory partition per processor □  Data is designated as local (near) or global (possibly far)

□  Programmer is aware of NUMA nodes and can handle data and task placement explicitly

■  Only way to allow performance optimization in systems with deep memory hierarchies

3

PGAS Libraries

■  PGAS languages □  Unified Parallel C (Ansi C)

□  Co-Array Fortran / Fortress (F90) □  Titanium (Java), Chapel (Cray), X10 (IBM), … □  All research, no wide-spread solution on industry level

■  Core data management functionality can be re-used as library

□  Global-Address Space Networking (GASNet) ◊ Used by many PGAS languages - UPC, Co-Array Fortran,

Titanium, Chapel □  Aggregate Remote Memory Copy Interface (ARMCI) ◊  Blocking / non-blocking API, MPI compatibility

□  Kernel Lattice Parallelism (KeLP)

◊ C++ class framework based on MPI

4

Unified Parallel C (UPC)

■  Extension of C for HPC on large-scale supercomputers ■  Initiative started 1996 at the University of Berkeley

■  Meanwhile consortium for language specification □  Latest language specification v1.3 from 2013

■  Extension of ISO C 99 with □  An explicitly parallel execution model

□  An explicit shared memory consistency model □  Synchronization primitives □  Memory management primitives

■  Support for major HPC platforms (Tru64, HP-UX, Cray, Altix, …)

■  Support from commercial debuggers

5

Unified Parallel C (UPC)

■  Execution environment □  Each UPC program consists of a set of threads

□  SPMD execution of UPC threads with flexible placement □  Threads may allocate shared and private data □  Access to this data can be local or shared □  Shared access can be strict or relaxed with respect to

memory consistency □  Strict-only access to variables leads to sequential consistency

□  Relaxed operations are finished before a strict operation

6

Unified Parallel C

■  Data is by default private, exists as copy per thread ■  New data type qualifier shared for shared thread data

□  Shared data has affinity for a particular UPC thread □  Primitive / pointer / aggregates: Affinity with UPC thread 0 □  Array type: cyclic affinity per element, block-cyclic affinity,

partitioning □  Pointers to shared data consist of thread ID, local address,

and position

7

#include <upc_relaxed.h> #define N 100*THREADS shared int v1[N], v2[N], v1plusv2[N]; void main() { int i; for(i=0; i<N; i++) if (MYTHREAD==i%THREADS) v1plusv2[i]=v1[i]+v2[i]; }

Unified Parallel C

■  shared int A[100] □  Distributes array cyclically across all thread memories

■  shared [B] int A[100] □  Distributes chunks of size B across all threads

■  shared [*] int A[100] □  Position all array elements in thread 0

■  Explicit synchronization primitives □  upc_lock, upc_unlock, upc_lock_attempt, upc_lock_t upc_barrier, upc_notify, upc_wait

■  Collective operations (upc_all_broadcast)

8

Unified Parallel C

■  Each memory reference / statement can be annotated □  Strict modifier: Sequential consistency (references from the

same thread are in order) □  Relaxed modifier: Issuing thread sees sequential consistency,

no other guarantees ■  Manual optimization still needed, but data management is

encapsulated in abstract keywords

9

Unified Parallel C

■  Loop parallelization with upc_forall ■  Assignment of field elements to threads must be done explicitly

with fourth parameter □  Identify thread

by shared pointer □  Distribute in

round-robin fashion according to fixed number

□  Block-wise assignment

10

shared int a[100], b[100], c[100]; int i;

upc_forall(i=0; i<100; i++; &a[i]) a[i]=b[i]*c[i];

upc_forall(i=0; i<100; i++; i) a[i]=b[i]*c[i];

upc_forall(i=0; i<100; i++; (i*THREADS)/100) a[i]=b[i]*c[i];

X10

■  Parallel object-oriented PGAS language by IBM □  Java derivate, compiles to C++ or pure Java code

□  Different binaries can interact through common runtime □  X86 and Power support □  Transport: Shared memory, TCP/IP, MPI, CUDA, … □  Linux, MacOS X, Windows, AIX

□  Full developer support with Eclipse environment ■  Fork-join execution model, instead of SPMD as in MPI ■  One application instance runs at a fixed number of places

□  Each place has a private copy of static variables

□  main() method runs automatically at place 0 □  Each place typically represents a NUMA node

11

X10

12

Place-shifting operations• at(p) S• at(p) e

… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

…

Global�Reference

Distributed heap• GlobalRef[T]• PlaceLocalHandle[T]

APGAS in X10: Places and Tasks

Task parallelism• async S

• finish S

Concurrency control within a place• when(c) S• atomic S

11

■  Parallel tasks, each operating in one place of the PGAS □  Direct variable access only in local place of the global space

□  Tasks mapped to places, potentially on different machines ■  Implementation

□  One operating system process per place, manages thread pool □  Work-stealing scheduler, queue with pending async’s

X10

13


… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

…

Global�Reference




• finish S


11

■  async S □  Creates a new task that executes S, returns immediately

□  S may reference all variables in the enclosing block ■  finish S

□  Execute S and wait for all transitively spawned tasks (barrier) □  If one task throws an exception, all others are finished first

Example

14 public class Fib {! public static def fib(n:int) {! if (n<=2) return 1;! val f1:int;! val f2:int;! finish {! async { f1 = fib(n-1); }! f2 = fib(n-2);! }! return f1 + f2;! }!! public static def main(args:Array[String](1)) {! val n = (args.size > 0) ? int.parse(args(0)) : 10;! Console.OUT.println("Computing Fib("+n+")");! val f = fib(n);! Console.OUT.println("Fib("+n+") = "+f);! }!}!

X10

15


… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

…

Global�Reference




• finish S


11

■  atomic S □  Execute S atomically, with respect to all other atomic blocks

□  S must not create concurrency, access only local data ■  when(c) S

□  Suspend current task until c, then execute S atomically

Example

16 Examples

class Account {public var value:Int;

def transfer(src:Account, v:Int) {atomic {

src.value -= v;this.value += v;

}}

}

class Latch {private var b:Boolean = false;def release() { atomic b = true; }def await() { when(b); }

}

class Buffer[T]{T isref,T haszero} {protected var datum:T = null;

public def send(v:T){v!=null} { when(datum == null) {datum = v;

}}

public def receive() {when(datum != null) {

val v = datum;datum = null;return v;

}}

}

31

Examples

class Account {public var value:Int;

def transfer(src:Account, v:Int) {atomic {

src.value -= v;this.value += v;

}}

}

class Latch {private var b:Boolean = false;def release() { atomic b = true; }def await() { when(b); }

}

class Buffer[T]{T isref,T haszero} {protected var datum:T = null;

public def send(v:T){v!=null} { when(datum == null) {datum = v;

}}

public def receive() {when(datum != null) {

val v = datum;datum = null;return v;

}}

}

31

Tasks Can Move

17


… …… …

Activities

Local�Heap

Place�0

………

Activities

Local�Heap

Place�N

…

Global�Reference




• finish S


11

■  at(p) S □  Execute statement S at place p, block current task

■  at(p) e □  Evaluate expression e at place p and return the result

■  at(p) async S □  Create new task at p to run S, return immediately

Example

18

HelloWholeWorld.x10

class HelloWholeWorld {public static def main(args:Rail[String]) {finish

for(p in Place.places()) at(p) async

Console.OUT.println(p + “ says “ + args(0));Console.OUT.println(“Bye”);

}}

$ x10c++ HelloWholeWorld.x10$ X10_NPLACES=4 ./a.out hello Place(0) says helloPlace(2) says helloPlace(3) says helloPlace(1) says helloBye

46

X10 Object Model

■  Object live in a single place □  Tasks shift their place, not objects

□  Bring the work to the data, not the other way around ■  Global references can be created and used explicitely

□  val ref:GlobalRef[Rail[Int]] = GlobalRef(rail); ■  at transparently copies the reachable object graph

□  Compiler identifies reachable parts of the object graph □  Runtime copies the neccessary data □  Global references are serialized, not the content □  Special support for arrays as non-reference type

19

X10 Example: Parallel Sum

20 public class ParaSum {! public static def main(argv:Rail[String]!) {! val id = (i:Int) => i; // integer identity function! x10.io.Console.OUT.println("sum(i=1..10)i = " + sum(id, 1, 10));! val sq = (i:Int) => i*i; // integer square function, inline def. used instead! x10.io.Console.OUT.println("sum(i=1..10)i*i = " + sum((i:Int)=>i*i, 1, 10)); }!! public static def sum(f: (Int)=>Int, a:Int, b:Int):Int {! val s = Rail.make[Int](1);! s(0) = 0;! finish {! for(p in Place.places) {! async{ // Spawn async at each place to compute its local range! val pPartialSum = at(p) sumForPlace(f, a, b);! atomic { s(0) += pPartialSum; } // add partial sums! }}} ! return s(0) } // return total sum!! private static def sumForPlace(f: (Int)=>Int, a:Int, b:Int) {! var accum : Int = 0;! // each processor p of K computes f(a+p.id), f(a+p.id+K), f(a+p.id+2K), etc.! for(var i : Int = here.id + a; i <= b; i += Place.places.length {! accum += f(i); }! return accum;! }}!

Fortran

■  Programming language for scientific computing □  Developed in the 1950’s by IBM

□  FORTRAN 66: First ANSI-standardized version □  FORTRAN 77: Structured Programming □  FORTRAN 90: Modular Programming, Array Operations □  FORTRAN 03: Object-Oriented Programming

□  FORTRAN 08: Concurrent Programming ■  Primary language in high-performance computing

21

[Wik

iped

ia]

Fortran

22

[Wik

iped

ia]

Array operation

High-Performance Fortran (HPF)

■  High-Performance Fortran as Fortran 95 extension ■  Minimal set of extensions for classical Fortran language

□  Data-parallel programming model □  Expression of parallelism, data distribution and alignment

□  Two-level mapping of data objects to abstract processors ◊  Array elements are aligned with template structure ◊  Template elements are distributed to abstract processors

23 T. Haupt HPF Tutorial

3 Data Mapping

3.1 Overview

templatedata objectsphysical processors

with arbitrary topologyabstract processorswith grid topology

grid mappingimplementation dependent

!HPF$ TEMPLATE !HPF$ PROCESSORS

!HPF$ ALIGN !HPF$ DISTRIBUTE

HPF data alignment and distribution directives allow the programmer toadvise thecompiler how to assign data object (typically arrayelements) to processors’ memories.The model (c.f. figure) is that there is atwo-level mapping of data objects to memoryregions, referred to as"abstract processors":

• arrays are first aligned relative to oneanother,• and then this group of arrays is distributed onto a userdefined, rectilinear arrangement

of abstract processors

The final mapping, abstract to physical processors is not specified by HPF and it islanguage-processor dependent.

The alignment itself is logically accomplished in two steps. First, the index spacespanned by an array that serves as an align target defines a natural template of thearray. Then, an alignee is associated with this template. In addition, HPF allows users todeclare a template explicitly; this is particularconvenient when aligning arrays of differentsize and/or different shape. It is the template (either a natural or explicit one) that isdistributed onto abstract processors. This means, that all arrays’elements aligned with anelement of the template are mapped to the same processor. This way locality of data isforced. Arrays and other data object that are not explicitly distributed using the compilerdirectives are mapped according to an implementation dependent default distribution.

3


■  PROCESSOR directive □  Declare rectangular processor arrangement

□  Defined by number of dimensions and extend per direction □  Final mapping (abstract -> physical) is not part of HPF

■  TEMPLATE directive □  Template defined by number of dimensions and extends

□  Abstract space of indexed positions □  Does not occupy memory at run-time □  Each data structure has a natural template ◊  Example: For array, index space that is identical

24


■  HPF compiler directives as structured comments □  Hints for the compiler □  No change to the program

semantics ■  DISTRIBUTE directive

□  Allows template to be distributed (BLOCK / CYCLIC)

□  Any dimension of the template can be collapsed or replicated on a processor grid

25

T. Haupt HPF Tutorial

if the directive is to be satisfied. The CYCLIC(n) distribution specifies that successivearray elements’ blocks of size n are to be dealt out to successive abstract processorsin round-robin fashion. Finally, CYCLIC distribution is equivalent to the CYCLIC(1)distribution. The HPF distributions are illustrated in the following diagram.

1234

5678

101112

9 13141516

6789

10

26

1014

37

1115

129

10

34

1112

78

1516

56

1314

48

1216

1112131415

!HPF$ TEMPLATE T(16)

!HPF$ PROCESSORS P(4)

!HPF$ DISTRIBUTE T(BLOCK(5)) ONTO P

16

1

913

12345

5

!HPF$ DISTRIBUTE T(CYCLIC) ONTO P

!HPF$ DISTRIBUTE T(BLOCK) ONTO P

!HPF$ DISTRIBUTE T(CYCLIC(2)) ONTO P

EXAMPLES OF HPF DISTRIBUTIONS

Every object is created as if according to some complete set of specification directives;if the program does not include complete specifications for the mapping of some object,the compiler provides defaults. The default distribution is language-processor dependent,but must be expressible as explicit directives for that implementation.

3.3 ALIGN

The ALIGN directive is used to specyfy that certain data objects are to be mappedin the same way as certain other data objects. Operations between aligned data objectsare likely to be more efficient than operations between data objects that are not knownto be aligned.

5


■  ALIGN directive □  Specify that some data objects

should be mapped as others □  Intention to optimize

operations that act on both □  Many options: shifts, strides,

transposition of indices, ... ■  ALIGN and DISTRIBUTE are static

data mappings □  Part of subroutine declaration

■  Dynamic versions at runtime □  REALIGN □  REDISTRIBUTE

26


Data objects such as arrays may be aligned one with another in many ways. Therepertoire includes shifts, strides, or any other linear combination of a subscript (i.e., n*i+ m), transposition of indices, and collapse or replication of array’s dimensions. Skewedor irregular alignments are, however, not allowed.

INTEGER, DIMENSION(4,4) :: B!HPF$ TEMPLATE T(12,12)!HPF$ ALIGN B(I,J) WITH T(2:12:3,1:12:3)

REAL, DIMENSION(8,8) :: C,D!HPF$ TEMPLATE T(12,12)!HPF$ ALIGN C(:,:) WITH T(:,:)!HPF$ ALIGN D(I,J) WITH T(I+5,J+5)

SHIFT AND STRIDESTRANSPOSITION AND STRIDE

COLLAPSERELATIVE ALIGNMENT

EXAMPLES OF HPF ALIGNMENTS

!HPF$ TEMPLATE T(12,12)!HPF$ ALIGN A(I,J) WITH T(2*J-1,I)

REAL, DIMENSION(12,6) :: A

REAL, DIMENSION(8,12) :: E!HPF$ TEMPLATE T(12)!HPF$ ALIGN(*.:) WITH T(:)

If an object A is aligned with an object B, which in turn is aligned to an objectC, this is regarded as an alignment of A with C directly. We say that A isultimatelyaligned withC. If an object is not explicitly aligned with another object, we say that itis ultimately aligned with itself.

It is illegal to explicitly realign an object (REALIGN direcive) if anything else isaligned to it and it is illegal to explicitly redistribute an object (REDISTRIBUTE direcive)if it is aligned with another object.

6


27

07.01.2013

14

High Performance Fortran

27

Data distribution in HPF

!HPF$ PROCESSORS :: prc(5), chess_board(8, 8)

!HPF$ PROCESSORS :: cnfg(-10:10, 5)

!HPF$ PROCESSORS :: mach( NUMBER_OF_PROCESSORS() )

REAL :: a(1000), b(1000)

INTEGER :: c(1000, 1000, 1000), d( 1000, 1000, 1000)

!HPF$ DISTRIBUTE (BLOCK) ONTO prc :: a

!HPF$ DISTRIBUTE (CYCLIC) ONTO prc :: b

!HPF$ DISTRIBUTE (BLOCK(100), *, CYCLIC) ONTO cnfg :: c

!HPF$ ALIGN (i,j,k) WITH d(k,j,i) :: c

28

HPF Data Parallelism

■  With parallel array assignments ■  With FORALL statements

■  With INDEPENDENT directive □  Instruct compiler that instructions in the following FORALL

statement have no data dependency ■  With PURE functions

□  Functions with syntactical restrictions to produce no side-effects, mandatory in FORALL body

■  With some of the intrinsic data functions

28


Here, the assignment a(i,j)=b(i,j) is executed only for these pairs of indices (i,j) whereelements of the logical array mask(i,j) evaluate to .TRUE.

Elemental invocation of intrinsic functions

Arrays and array sections can be arguments to a broad class of elemental initrinsicfunctions, such as ABS, ATAN, COS, COSH, EXP, to name a few. For example, if A,B are arrays as defined above

B = ABS(A)is equivalent to Fortran77

DO i=1,100DO j=1,100

b(i,j) = ABS(A(i,j))END DO

END DO

FORALL statement and construct

FORALL statement and construct are new language features to express data par-allelism, that is, to provide a convenient syntax for simultaneous assignments to largegroups of array elements. The functionality they provide is very similar to that providedby the array assignments and the WHERE constructs in Fortran 90. In fact, all Fortran90 array assignments, including WHERE, can be expressed using FORALL statements.For example,

B = 1.0A = BA(1:98,3:100)=B(3:100,1:98)WHERE(B.GT.0) A=2.*B

can be expressed using FORALL syntax asFORALL(I=1:100,J=1:100) B(I,J)=1.0FORALL(I=1:100,J=1:100) A(I,J)=B(I,J)FORALL(I=1: 98,J=3:100) A(I,J)=B(I+2,J-2)FORALL(I=1:100,J=1:100,B(I,J).GT.0) A(I,J)=2.*B(I,J) How-

ever, Fortran 90 places several restrictions on array assignments. In particular, it requiresthat operands of the right side expressions be conformable with the left hand side ar-

22

Declarative Programming

■  .NET „Language Integrated Query (LINQ)“

■  General purpose query facility, e.g. for databases or XML

■  Declarative standard query operators

■  PLINQ is parallelizing the execution of queries on objects and XML ■  Declarative style of LINQ allows seamless transition to parallel

version of the code

29 var query = from p in products! where p.Name.StartsWith("A")! orderby p.ID! select p;! ! foreach ( var p in query ) {! Console.WriteLine ( p.Name );! }!

IEnumerable<T> data = ...; var q = data.Where(x => p(x)).Orderby(x => k(x)).Select(x => f(x)); foreach (var e in q) a(e);

IEnumerable<T> data = ...; var q = data.AsParallel().Where(x => p(x)).Orderby(x => k(x)).Select(x => f(x)); foreach (var e in q) a(e);

Functional Programming

■  Programming paradigm that treats execution as function evaluation -> map some input to some output

■  Contrary to imperative programming □  No longer focus on statement execution for state modification

□  Programmer no longer specifies control flow explicitly □  High-level solution

■  Side-effect free computation through avoidance of local state -> referential transparency (no demand for some control flow)

■  Typically strong focus on immutable data as language default -> instead of altering values, return altered copy

■  One foundation: Alonzo Church‘s lambda calculus from the 1930‘s

■  First functional language was Lisp (late 50s) ■  Trend to add functional programming features into imperative

languages (anonymous functions, fiter, map, …)

30

Imperative to Functional

31

alert("I'd like some Spaghetti!");!alert("I'd like some Chocolate Moose!");!

function SwedishChef( food ) !{! alert("I'd like some " + food + "!");!}!SwedishChef("Spaghetti");!SwedishChef("Chocolate Moose");!

alert("get the lobster");! PutInPot("lobster");! PutInPot("water");!! alert("get the chicken");! BoomBoom("chicken");! BoomBoom("coconut");!

function Cook( i1, i2, f ) {! alert("get the " + i1);! f(i1); f(i2); } !!Cook( "lobster", "water", ! function(x) { alert("pot " + x); } );!Cook( "chicken", "coconut", ! function(x) { alert("boom " + x); } );!

Optimize

http://www.joelonsoftware.com/items/2006/08/01.html

Optimize

function Cook( i1, i2, f ) {! alert("get the " + i1);! f(i1); f(i2); } !!Cook( "lobster", "water", PutInPot);!Cook( "chicken", "coconut", BoomBoom): !

Anonymous Function


■  map() does not demand particular operation ordering 32

var a = [1,2,3];!for (i=0; i<a.length; i++) {! a[i] = a[i] * 2; !} ! !for (i=0; i<a.length; i++) {! alert(a[i]); !}!

function map(fn, a)! {! for (i = 0; i < a.length; i++)! {! a[i] = fn(a[i]);! }! }!

map( function(x){return x*2;}, a );!map( alert, a );!


■  map() and reduce() functions do not demand particular operation ordering

33 function sum(a) {! var s = 0;! for (i = 0; i < a.length; i++)! s += a[i];! return s;!}!function join(a) {! var s = "";! for (i = 0; i < a.length; i++)! s += a[i];! return s; }! alert(sum([1,2,3])); !alert(join(["a","b","c"]));!

function reduce(fn, a, init){! var s = init;! for (i = 0; i < a.length; i++)! s = fn( s, a[i] );! return s;!}!!function sum(a){! return reduce( function(a, b){ return a + b; }, a, 0 );!}!!function join(a){! return reduce( function(a, b){ return a + b; }, a, "" );!}!

Imperative to Functional - Python

# Nested loop procedural style for finding big products

xs = (1,2,3,4)

ys = (10,15,3,22)

bigmuls = []

for x in xs:

for y in ys:

if x*y > 25:

bigmuls.append((x,y))

print bigmuls

34

print [(x,y) for x in (1,2,3,4) for y in (10,15,3,22) if x*y > 25]

[Dav

id M

erz]


■  Higher order functions: Functions as argument or return value ■  Pure functions: No memory or I/O side effects

□  If the result of a pure expression is not used, it can be removed

□  A pure function called with side-effect free parameters has a constant result

□  Without data dependencies, pure functions can run in parallel □  A language with only pure function semantic can change

evaluation order □  Functions with side effects (e.g. printing) typically do not

return results ■  Recursion as replacement for looping (e.g. factorial) ■  Lazy evaluation possible, e.g. to support infinite data structures

■  Perfect foundation for implicit parallelism ...

35


■  Pure functions save state on the stack as function parameters □  But: Applications must have side effects

□  The application task is to modify state □  Goal is to limit side-effects and to concentrate them

■  Many popular new functional languages □  JVM-based: Clojure, Scala (parts of it)

□  Common Lisp, Erlang, F#, Haskell, ML, Ocaml, Scheme

36

Clojure

■  Dynamically typed, functional programming language ■  Derivation from Lisp

■  Runs on JVM >= v5, allows Java interoperability □  Variations with backend for .NET and JavaScript

■  Major goal: Easier development of data-parallel applications ■  Clojure operation: Function, macro, or special form

□  Special forms are implemented by the compiler □  Typical keywords (new, throw, monitor-enter, if, let, ...)

■  Three ways of sharing mutable data in a safe way □  Refs: Coordinated access by software-transactional memory

□  Atoms: Synchronized access to one data item □  Agents: Asychronous access to one data item

37

Clojure

38

„Dat

a“

(Fun

ctio

n)

Clojure

39

Fortress (== „Secure Fortran“)

■  Oracle / Sun Programming Language Research Group, Guy L. Steele (Scheme, Common Lisp, Java)

■  Language designed for (mathematical) high-performance computing

■  Dynamic compilation, type inference

■  Growable language: Prefer library over compiler

■  Mathematical notation □  Everything is an expression, some having

void value (e.g. while, for, assignment) □  Source code can be rendered in ASCII,

Unicode, or as image ■  Functional programming concepts,

but also Scala / Haskell derivations

40

PT 2013

Fortress - Comparison to C

■  No memory management, all handled by runtime system ■  Implicit instead of explicit threading

■  Set of types similar to C library ■  Fortress program state: Number of threads + memory ■  Fortress program execution: Evaluation of expressions in all

threads ■  Component model supported, interfaces can be imported and

exported □  Components live in the ,fortress‘ database, interaction through

shell

41

Fortress Syntax

■  Adopt math whenever possible □  Integer, naturals, rationals, complex number, floating point …

□  Support for units and dimensions ■  Everything is an expression, () is the void value

□  Statements are void-type expressions (while, for, assignment, binding)

□  Some statements have non-() values (if, do, try, case, spawn, ...) •  if x ≥ 0 then x else -x end •  atomic x := max(x, y[k])

■  Generators: „j:k“ - range, „j#n“ - n consecutive integers from j, ...

42

Fortress Basics

■  Object: Fields and methods ■  Traits: Set of abstract / concrete methods

(extended interface concept) ■  Every object extends a set of traits

43

trait Boolean extends BooleanAlgebra⟦Boolean,∧,∨,¬,⊻,false,true⟧ comprises { true, false } opr ∧(self, other: Boolean): Boolean opr ∨(self, other: Boolean): Boolean opr ¬(self): Boolean end object true extends Boolean opr ∧(self, other: Boolean) = other opr ∨(self, other: Boolean) = self opr ¬(self) = false end ...

Fortress - Functions

■  Functions ◊  Static (nat or int)

parameters □  One variable parameter

□  Optional return value □  Optional body expression □  Result comes from

evaluation of the body ■  do-end expression: Sequence of expressions with implicit parallel

execution, last defining the blocks‘ result □  Supports also do syntax for explicit parallelism

44

do factorial (10) also do factorial (5) also do factorial (2) end

Fortress - Parallelism

■  Parallel programming as necessary compromise, not as primary goal

■  Implicit parallelism wherever possible, supported by functional approach □  Evaluated in parallel: function / method arguments,

operator operands, tuple expressions (each element evaluated separately), loop iterations, sums

□  Loop iterations are parallelized □  Generators generate values in parallel,

called functions run in parallel ■  Race condition handling through atomic keyword,

explicit spawn keyword

45 for i <- 1:5 do print(i ““) print(i ““) end

for i <- sequential(1:5) do print(i ““) print(i ““) end

Asynchronous Programming ?

■  Huge hype around asynchronous programming □  A model to implement I/O concurrency

□  Explicit avoidance of any threading overhead □  Focus on fast context switch between many activities

■  Often implemented by event-driven programming style □  Activities perform callback when they are done

□  Whole API modeled around this idea ◊  JavaScript, Node.JS, Python Twisted, … ◊ Mostly based on libevent wrapper library for /dev/poll,

kqueue, epoll, … ■  Mainly targets the problem of blocking code on I/O activity ■  No standalone solution for speedup, but helps with scaling

46

Profiling

■  Many available profiling tools for shared-memory parallelism ■  Sampling profiler

□  Sporadic recording of application state □  Time-driven: Uniform time period between samples □  Event-driven: Uniform event number between samples □  Original code remains unchanged

□  Small impact on execution behavior, can find race conditions □  Low overhead (hardware-based ~ 2%, software-based ~ 5%)

■  Instrumenting profiler □  Modification of original application with measurement code

□  Allows gathering of all possible events □  Higher accuracy than sampling, but also higher overhead

■  Data gathering is one part, vizualisation a different one

47

Intel VTune

■  Commercial profiling tool, for command-line or GUI ■  Plugins for Eclipse and Visual Studio

■  Support for Fortran, C, C++, Java, .NET, Assembly ■  Linux and Windows ■  Standard features

□  Hotspot analysis – where is the most time spent ?

□  Concurrency analysis – are all cores well utilized ? □  Lock analysis – which locks are ‚hot‘ ?

■  User-mode sampling □  Sampling library is dynamically attached via LD_PRELOAD

□  Set up timer per thread, then issue signal and take sample ■  Hardware-event based sampling

48

Performance Monitoring Unit

49

Performance Monitoring Unit

• one PMU per core• one PMU in uncore region

• elapsed cycles• L1, L2 cache events• processed

instructions• …

• uncore bound• QPI events• L3 cache events• memory controller

events• …

Intel VTune Lena Herscheid 25

Example

50

Hotspot Analysis (Whetstone)


Bottom-Up View

Top-Down View(Call Tree)

Example

51 Hotspot Analysis (Whetstone)


Example

52 Concurrency Analysis (Whetstone)


Example

53

Locks and Waits Analysis (Whetstone)


Documents

Advanced Shared-Memory Programming · Advanced Shared-Memory Programming ... (Ansi C) Co-Array ... All research, no wide-spread solution on industry level