55
Code Optimization of Parallel Programs Vivek Sarkar Rice University [email protected] L2 L2 L2 LSU LSU IFU BXU IDU IDU IFU BXU FPU FPU ISU ISU

Code Optimization of Parallel Programs Vivek Sarkar Rice University [email protected] Vivek Sarkar Rice University [email protected]

Embed Size (px)

Citation preview

Page 1: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

Code Optimization of Parallel ProgramsCode Optimization of Parallel Programs

Vivek SarkarRice University

[email protected]

Vivek SarkarRice University

[email protected]

L3 Directory/Control

L2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPUFXUFXU

ISU ISU

Page 2: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

2

Parallel Software Challenges & Focus Area for this TalkParallel Software Challenges & Focus Area for this Talk

Middleware

Parallel Runtime & System Libraries

OS and Hypervisors

Languages

Programming Tools

Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers

Explicitly parallel languages e.g., OpenMP, Java Concurrency, .NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified

Parallel C, Co-Array Fortran, X10, Chapel, Fortress

Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker

Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management

Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures

Virtualization, scalable management of heterogeneous resources per core (frequency, power)

Static & Dynamic Optimizing Compilers

Domain-specific Programming Models

Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall),

Application Libraries

Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security

Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization

Multicore Back-ends

Page 3: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

3

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to

Code Optimization of Parallel Code Rice Habanero Multicore Software project

Page 4: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

4

Our Current Paradigm for Code Optimization has served us well for Fifty Years ….Our Current Paradigm for Code Optimization has served us well for Fifty Years ….

Translation Translation Translation

Fortran Autocoder II ALPHA

IL

OPTIMIZER

REGISTER ALLOCATOR

IL

IL

ASSEMBLER

STRETCH STRETCH-HARVEST

OBJECT CODE

Stretch – Harvest Compiler Organization

(1958 - 1962) Source: “Compiling for Parallelism”,Fran Allen, Turning Lecture, June 2007

Page 5: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

5

… and has been adapted to meet challenges along the way … … and has been adapted to meet challenges along the way …

Interprocedural analysis Array dependence analysis Pointer alias analysis Instruction scheduling & software pipelining SSA form Profile-directed optimization Dynamic compilation Adaptive optimization Auto-tuning . . .

Page 6: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

6

… but is now under siege because of parallelism … but is now under siege because of parallelism

Proliferation of parallel hardware Multicore, manycore, accelerators, clusters, …

Proliferation of parallel libraries and languages OpenMP, Java Concurrency, .NET Parallel

Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, …

Page 7: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

7

Paradigm ShiftsParadigm Shifts "The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970) A paradigm is a scientific structure or framework consisting of

Assumptions, Laws, Techniques Normal science is a puzzle solving activity governed by the rules

of the paradigm. It is uncritical of the current paradigm,

Crisis sets in when a series of serious anomalies appear “The emergence of new theories is generally preceded by a

period of pronounced professional insecurity” Scientists engage in philosophical and metaphysical disputes.

A revolution or paradigm shift occurs when an an entire paradigm is replaced by another

Page 8: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

8

Kuhn’s History of ScienceKuhn’s History of Science

Normal Science

Immature Science

Anomalies

Crisis

Revolution

Revolution: A new paradigm emergesOld Theory: well established, many followers, many anomalies

New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questionsSource: www.philosophy.ed.ac.uk/ug_study/ ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt

Page 9: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

9

Some Well Known Paradigm ShiftsSome Well Known Paradigm Shifts

Newton’s Laws to Einstein's Theory of Relativity Ptolemy’s geocentric view to Copernicus and

Galileo’s heliocentric view Creationism to Darwin’s Theory of Evolution

Page 10: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

10

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches Rice Habanero Multicore Software project

Page 11: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

11

What anomalies do we see when optimizing parallel code?What anomalies do we see when optimizing parallel code?

Examples1. Control flow rules2. Data flow rules3. Load elimination rules

Page 12: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

12

1. Control Flow Rules from Sequential Code Optimization1. Control Flow Rules from Sequential Code Optimization

Control Flow Graph Node = Basic Block Edge = Transfer of Control Flow Succ(b) = successors of block b Pred(b) = predecessors of block b

Dominators Block d dominates block b if every (sequential) path from

START to b includes d Dom(b) = set of dominators of block b Every block has a unique immediate dominator (parent in

dominator tree)

Page 13: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

13

Dominator ExampleDominator Example

START

BB1

BB2 BB3

BB4

STOP

Control Flow Graph

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

T F

START

BB1

BB2 BB3 BB4

STOP

Dominator Tree

Page 14: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

14

Anomalies in Control Flow Rules for Parallel CodeAnomalies in Control Flow Rules for Parallel Code

BB1parbeginBB2

||BB3

parendBB4

Does B4 have a unique immediate dominator? Can the dominator relation be represented as a tree?

BB1

FORK

BB2 BB3

JOIN

BB4

Parallel Control Flow Graph

Page 15: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

15

2. Data Flow Rules from Sequential Code Optimization2. Data Flow Rules from Sequential Code Optimization

Example: Reaching Definitions REACHin(n) = set of definitions d s.t. there is a

(sequential) path from d to n in the CFG, and d is not killed along that path.

QuickTime™ and a decompressor

are needed to see this picture.

Page 16: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

16

Anomalies in Data Flow Rules for Parallel CodeAnomalies in Data Flow Rules for Parallel Code

What definitions reach COEND?

What if there were no synchronization edges?

How should the data flow equations be defined for parallel code?

QuickTime™ and a decompressor

are needed to see this picture.

control

sync

S1: X1 := … parbegin // Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …|| // Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8); parend . . .

Page 17: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

17

3. Load Elimination Rules from Sequential Code Optimization3. Load Elimination Rules from Sequential Code Optimization

A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P

T1 := *q

T2 := *p

T3 := *q

T1 := *q

T2 := *p

T3 := T1

Page 18: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

18

Anomalies in Load Elimination Rules for Parallel Code(Original Version)Anomalies in Load Elimination Rules for Parallel Code(Original Version)

TASK 1. . .T1 := *qT2 := *pT3 := *qprint T1, T2, T3

Question: Is [0, 1, 0] permitted as a possible output?Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979] But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000]

TASK 2. . . *p = 1. . .

Assume that p = q, and that *p = *q = 0 initially.

Page 19: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

19

Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination)

TASK 1. . .T1 := *qT2 := *pT3 := T1print T1, T2, T3

Question: Is [0, 1, 0] permitted as a possible output?Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed!

TASK 2. . . *p = 1. . .

Assume that p = q, and that *p = *q = 0 initially.

Page 20: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

20

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to

Code Optimization of Parallel Code Rice Habanero Multicore Software project

Page 21: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

21

Incremental Approaches to coping with Parallel Code OptimizationIncremental Approaches to coping with Parallel Code Optimization

Large investment in infrastructures for sequential code optimization

Introduce ad hoc rules to incrementally extend them for parallel code optimization Code motion fences at sycnhronization operations Task creation and termination via function call

interfaces Use of volatile storage modifiers . . .

Page 22: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

22

More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the FutureMore Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future

Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs Abstract execution model for PIR Storage classes (types) for locality and memory hierarchies General framework for task partitioning and code motion in

parallel code Compiler-friendly memory model Combining automatic parallelization and explicit parallelism

. . .

Page 23: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

23

Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]

A Program Dependence Graph, PDG = (N', Ecd, Edd) is derived from a CFG and consists of:

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 24: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

24

PDG ExamplePDG Example

/* S1 */ max = a[i];

/* S2 */ div = a[i] / b[i] ;

/* S3 */ if ( max < b[i] )

/* S4 */ max = b[i] ;

QuickTime™ and a decompressor

are needed to see this picture.

S1 S2 S3

S4

max(true)

max (output)

max (anti)

Page 25: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

25

PDG restrictionsPDG restrictions

Control Dependence Predicate-ancestor condition: if there are two disjoint

c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node

No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P

Page 26: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

26

Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993]

Node 4 is executed twice in this acyclic PDG

QuickTime™ and a decompressor

are needed to see this picture.

“Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993

Page 27: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

27

PDG restrictions (contd)PDG restrictions (contd)

Data Dependence There cannot be a data dependence edge in the

PDG from node A to node B if there is no path from A to B in the CFG

The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance IA of node A to an execution instance IB of node B if IB precedes IA in the CFG's execution e.g., a data dependence from iteration i+1 to

iteration i is not plausible in a sequential program

Page 28: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

28

Limitations of Program Dependence GraphsLimitations of Program Dependence Graphs

PDGs and CFGs are tightly coupled A transformation in one

must be reflected in the other

PDGs reveal maximum parallelism in the program

CFGs reveal sequential execution

Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks?

QuickTime™ and a decompressor

are needed to see this picture.

Page 29: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

29

Another Limitation: no Parallel Execution Semantics defined for PDGsAnother Limitation: no Parallel Execution Semantics defined for PDGs

What is the semantics of control dependence edges with cycles? What is the semantics of data dependences when a source or

destination node may have zero, one or more instances?

QuickTime™ and a decompressor

are needed to see this picture.

A[f(i,j)] = …

… = A[g(i)]

Page 30: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

30

Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992]

Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992]

A Parallel Program Graph, PPG = (N, Econtrol , Esync) consists of: N, a set of compute, predicate, and parallel nodes

A parallel node creates parallel threads of computation for each of its successors

Econtrol , a set of labeled control edges. Edge (A,B,L) in Econtrol identifies a control edge from node A to node B with label L.

Esync , a set of synchronization edges. Edge (A,B,F) in Esync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized

“A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992

Page 31: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

31

PPG ExamplePPG Example

QuickTime™ and a decompressor

are needed to see this picture.

Page 32: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

32

Relating CFGs to PPGsRelating CFGs to PPGs

Construction of PPG for a sequential program PPG nodes = CFG nodes

PPG control edges = CFG edges

PPG synchronization edges = empty set

Page 33: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

33

Relating PDGs to PPGsRelating PDGs to PPGs

Construction of PPG for PDGs PPG nodes = PDG nodes

PPG parallel nodes = PDG regions nodes

PPG control edges = PDG control dependence edges

PPG synchronization edges = PDG data dependence edges Synchronization condition F in PPG synchronization

edge mirrors context of PDG data dependence edge

Page 34: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

34

Example of Transforming PPGsExample of Transforming PPGs

QuickTime™ and a decompressor

are needed to see this picture.

Page 35: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

35

Abstract Interpreter for PPGsAbstract Interpreter for PPGs

Build a partial order of dynamic execution instances of PPG nodes as PPG execution unravels.

Each execution instance IA is labeled with its history (calling context), H(IA).

Initialize to a singleton set containing an instance of the start node, ISTART , with H(ISTART ) initialized to the empty sequence.

Page 36: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

36

Abstract Interpreter for PPGs (contd)Abstract Interpreter for PPGs (contd)

Each iteration of the scheduling algorithm:

Selects an execution instance IA in such that all of IA's

predecessors in have been scheduled Simulates execution of IA and evaluates branch label L

Creates an instance IB of each c.d. successor B of A for label L

Adds (IB, IC) to , if instance IC has been created in and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C)

Adds (IC, IB) to , if instance IC has been created in and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B)

Page 37: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

37

Abstract Interpreter for PPGs: ExampleAbstract Interpreter for PPGs: Example

1. Create ISTART

2. Schedule ISTART

3. Create IPAR

4. Schedule IPAR

5. Create I1, I2, I3

6. Add (I1, I3) to 7. Schedule I2

8. Schedule I1

9. Schedule I3

10. . . .

QuickTime™ and a decompressor

are needed to see this picture.

Page 38: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

38

Weak (Deterministic) Memory Model for PPGsWeak (Deterministic) Memory Model for PPGs All memory accesses are assumed to be non-atomic

Read-write hazard --- if Ia reads a location for which there is a

parallel write of a different value, then the execution result is an error Analogous to an exception thrown if a data race occurs May be thrown when read or write operation is performed

Write-write hazard --- if Ia writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined Execution results in an error if that location is subsequently read

Separation of data communication and synchronization: Data communication specified by read/write operations Sequencing specified by synchronization and control edges

Page 39: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

39

Soundness PropertiesSoundness Properties

Reordering Theorem For a given Parallel Program Graph, G, and input

store, i, the final store f = G(i) obtained is the same for all possible scheduled sequences in the abstract interpreter

Equivalence Theorem A sequential program and its PDG have identical

semantics i.e., they yield the same output store when executed with the same input store

Page 40: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

40

Reaching Definitions Analysis on PPGs [LCPC 1997]Reaching Definitions Analysis on PPGs [LCPC 1997]

QuickTime™ and a decompressor

are needed to see this picture.

“Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997

A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P.

Page 41: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

41

Reaching Definitions Analysis on PPGsReaching Definitions Analysis on PPGs

QuickTime™ and a decompressor

are needed to see this picture.

control

sync// Task 1S2: X2 := … post(ev2);S3: . . . post(ev3);S4: wait(ev8); X4 := …

QuickTime™ and a decompressor

are needed to see this picture.

// Task 2S5: . . .S6: wait(ev2);S7: X7 := …S8: wait(ev3); post(ev8);

S1: X1 := …

Page 42: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

42

PPG LimitationsPPG Limitations

Past work has focused on comprehensive representation and semantics for deterministic programs

Extensions needed for Atomicity and mutual exclusion Stronger memory models Storage classes with explicit locality

Page 43: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

43

Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999]

Questions: Can the load of p.x be moved below

the store of q.y? Can the load of p.x be moved outside

the synchronized block? Can the load of r.z be moved inside the

synchronized block? Can the load of r.z be moved back

outside the synchronized block? How should the data dependences be

modeled?

a = ...synchronized (L) { ... = p.x q.y = ... b =} ... = r.z

“Dependence Analysis for Java”, C.Chambers et al, LCPC 1999

Page 44: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

44

OutlineOutline

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to

Code Optimization of Parallel Code Rice Habanero Multicore Software project

Page 45: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

45

Habanero Project (habanero.rice.edu)Habanero Project (habanero.rice.edu)

1) HabaneroProgramming

Language

Sequential C, Fortran, Java,

ForeignFunctionInterface

Parallel Applications

Multicore Hardware

Vendor Compiler & Libraries

2) Habanero Static

Compiler

3) Habanero Virtual

Machine

4) Habanero Concurrency

Library

X10 …

5) Habanero Toolkit

Page 46: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

46

2) Habanero Static Parallelizing & Optimizing Compiler2) Habanero Static Parallelizing & Optimizing Compiler

Front End

IRGen

AST

C / Fortran(restricted code regions

for targeting accelerators & high-end computing)

InterproceduralAnalysis

ParallelIR (PIR)

AnnotatedClassfiles

PIRAnalysis &

Optimization

Portable Managed Runtime

Platform-specific static compiler

PartitionedCode

Sequential C, Fortran, Java,

ForeignFunctionInterface

X10/Habanero Language

ClassfileTransformations

Page 47: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

47

Habanero Target Applications and PlatformsHabanero Target Applications and PlatformsApplications:

Parallel Benchmarks SSCA’s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarksMedical Imaging Back-end processing for Compressive Sensing

(www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA)Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice), James Gunning (CSIRO)Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice)Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert

Harrison, Annirudha Shet (ORNL)Habanero Compiler Implement Habanero compiler in Habanero so as to

exploit multicore parallelism within the compiler

Platforms:

AMD Barcelona Quad-Core Clearspeed Advance X620 DRC Coprocessor Module w/ Xilinx Virtex

FPGA IBM Cell IBM Cyclops-64 (C-64) IBM Power5+, Power6 Intel Xeon Quad-Core NVIDIA Tesla S870 Sun UltraSparc T1, T2 . . .

Additional suggestions welcome!

Page 48: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

48

Habanero Research TopicsHabanero Research Topics1) Language Research Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore Implicit deterministic parallelism: array views, single-assignment constructs Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages

2) Compiler research New Parallel Intermediate Representation (PIR) Automatic analysis, transformation, and parallelization of PIR Optimization of high-level arrays and iterators Optimization of synchronization, data transfer, and transactional memory operations Code partitioning for accelerators Builds on our experiences with the D System, Massively Scalar, Telescoping Languages

Framework, ASTI and PTRAN research compilers

Page 49: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

49

Habanero Research Topics (contd.)Habanero Research Topics (contd.)3) Virtual machine research VM support for work-stealing scheduling algorithms with extensions for places,

transactions, task groups Runtime support for other Habanero language constructs (phasers, regions,

distributions) Integration and exploitation of lightweight profiling in VM scheduler and memory

management system Builds on our experiences with the Jikes Research Virtual Machine

4) Concurrency library research New nonblocking data structures to support the Habanero runtime Efficient software transactional memory libraries Builds on our experiences with the java.util.concurrent and DSTM2 libraries

5) Toolkit research Program analysis for common parallel software errors Performance attribution of shared code regions (loops, procedure calls) using static and

dynamic calling context Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects

Page 50: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

50

Opportunities for Broader ImpactOpportunities for Broader Impact

Education Influence how parallelism is taught in future Computer Science curricula

Open Source Build an open source testbed to grow ecosystem for researchers in

Parallel Software area Industry standards

Use research results as proofs of concept for new features that can be standardized

Infrastructure can provide foundation for reference implementations

Collaborations welcome!

Page 51: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

51

Habanero Team (Nov 2007)Habanero Team (Nov 2007)

Send email to Vivek Sarkar ([email protected]) if you are interested in a PhD, postdoc, research scientist, or programmer position

in the Habanero project, or in collaborating with us!

Page 52: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

52

Other Challenges in Code Optimization of Parallel CodeOther Challenges in Code Optimization of Parallel Code

Optimization of task coordination Task creation and termination --- fork, join Mutual exclusion --- locks, transactions Synchronization --- semaphores, barriers

Data Locality Optimizations Computation and data alignment Communication optimizations

Deployment and Code Generation Homogeneous Multicore Heterogeneous Multicore and Accelerators

Automatic Parallelization Revisited . . .

Page 53: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

53

Related Work (Incomplete List)Related Work (Incomplete List)

Analysis of nondeterministic sequentially consistent parallel programs [Shasha, Snir 1988], [Midkiff et al 1989], [Chow, Harrison 1992], [Lee et al

1997], … Analysis of deterministic parallel programs with copy-in/copy-out semantics

[Srinivasan 1994], [Ferrante et al 1996], … Value-oriented semantics for functional subsets of PDGs

[Selke 1989], [Cartwright, Felleisen 1989], [Beck, Pingali 1989], [Ottenstein, Ballance, Maccabe 1990] , …

Serialization of restricted subsets of PDGs [Ferrante, Mace, Simons 1988], [Simons et al 1990], …

Concurrency analysis [Long, Clarke 1989], [Duesterwald, Soffa 1991], [Masticola, Ryder 1993],

[Naumovich, Avrunin 1998], [Agarwal et al 2007], …

Page 54: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

54

PLDI 2008 Tutorial (Tucson, AZ)PLDI 2008 Tutorial (Tucson, AZ)

Analysis and Optimization of Parallel Programs Intermediate representations for parallel programs Data flow analysis frameworks for parallel programs Locality analyses: scalar/array privatization, escape analysis of objects,

locality types Memory models and their impact on code optimization of locks and

transactional memory operations Optimizations of task partitions and synchronization operations

Sam Midkiff, Vivek Sarkar Sunday afternoon (June 8, 2008, 1:30pm - 5:00pm)

Page 55: Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu Vivek Sarkar Rice University vsarkar@rice.edu

55

ConclusionsConclusions

New paradigm shift in Code Optimization due to Parallel Programs

Foundations of Code Optimization will need to be revisited from scratch Foundations will impact high-level and low-level

optimizers, as well as tools Exciting times to be a compiler researcher!