Transcript

Synchronization Transformationsfor

Parallel Computing

Pedro Dinizand

Martin Rinard

Department of Computer ScienceUniversity of California, Santa Barbara

http://www.cs.ucsb.edu/~{pedro,martin}

Motivation

Parallel Computing Becomes Dominant Form of Computation

Parallel Machines Require Parallel Software

Parallel Constructs Require New Analysis and Optimization Techniques

Our GoalEliminate Synchronization Overhead

Talk Outline

• Motivation

• Model of Computation

• Synchronization Optimization Algorithm

• Applications Experience

• Dynamic Feedback

• Related Work

• Conclusions

Model of Computation

• Parallel Programs• Serial Phases• Parallel Phases

• Single Address Space

• Atomic Operations on Shared Data• Mutual Exclusion Locks• Acquire Constructs• Release Constructs

Acq

S1MutualExclusionRegion

Rel

Reducing Synchronization Overhead

Acq

S1

S2

Rel

S3

Rel

Acq

Synchronization Optimization

Idea:Replace Computations that Repeatedly Acquire and Release the Same Lock with a Computation that Acquires and Releases the Lock Only Once

Result:Reduction in the Number of

Executed Acquire and Release Constructs

Mechanism:Lock Movement Transformations and

Lock Cancellation Transformations

Lock Cancellation

Acquire Lock Movement

Release Lock Movement

Synchronization Optimization Algorithm

Overview:

• Find Two Mutual Exclusion Regions With the Same Lock

• Expand Mutual Exclusion Regions Using Lock Movement Transformations Until They are Adjacent

• Coalesce Using Lock Cancellation Transformation to Form a Single Larger Mutual Exclusion Region

Interprocedural Control Flow Graph

Acquire Movement Paths

Release Movement Paths

Migration Paths and Meeting Edge

Intersection of Paths

Compensation Nodes

Final Result

Synchronization Optimization Trade-Off

• Advantage: • Reduces Number of Executed Acquires and Releases• Reduces Acquire and Release Overhead

• Disadvantage: May Introduce False Exclusion• Multiple Processors Attempt to Acquire Same Lock• Processor Holding the Lock is Executing Code that

was Originally in No Mutual Exclusion Region

False Exclusion Policy

Goal: Limit Potential Severity of False Exclusion

Mechanism: Constrain the Application of Basic

Transformations

• Original: Never Apply Transformations• Bounded: Apply Transformations only on

Cycle-Free Subgraphs of ICFG

• Aggressive: Always apply Transformations

Experimental Results

• Automatic Parallelizing Compiler Based on Commutativity Analysis [PLDI’96]

• Set of Complete Scientific Applications (C++ subset)• Barnes-Hut N-Body Solver (1500 lines of Code)• Liquid Water Simulation Code (1850 lines of Code)• Seismic Modeling String Code (2050 lines of Code)

• Different False Exclusion Policies

• Performance of Generated Parallel Code on Stanford DASH Shared-Memory Multiprocessor

Lock Overhead

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

Barnes-Hut (16K Particles)

Original

Bounded

Aggressive

Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exculsion Locks

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

Water (512 Molecules)

Original

BoundedAggressive

0

20

40

60

Perc

enta

ge L

ock

Ove

rhea

d

String (Big Well Model)

OriginalAggressive

Contention OverheadC

onte

ntio

n Pe

rcen

tage

Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors

100

0

25

50

75

0 4 8 12 16Processors

Barnes-Hut (16K Bodies)

0

25

50

75

100

0 4 8 12 16Processors

Water (512 Molecules)

0

25

50

75

100

0 4 8 12 16Processors

String (Big Well Model)

OriginalBoundedAggressive

0

2

4

6

8

10

12

14

16

Spe

edup

0 2 4 6 8 10 12 14 16Number of Processors

Ideal

Aggressive

Bounded

Original

Barnes-Hut (16384 bodies)

Performance Results : Barnes-Hut

Performance Results: Water

Ideal

Aggressive

Bounded

Original

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Water (512 Molecules)

Performance Results: String

String (Big Well Model)

Spe

edup

Number of Processors

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Ideal

Original

Aggressive

Choosing Best Policy

• Best False Exclusion Policy May Depend On• Topology of Data Structures• Dynamic Schedule Of Computation

• Information Required to Choose Best Policy Unavailable at Compile Time

• Complications• Different Phases May Have Different Best Policy• In Same Phase, Best Policy May Change Over Time

Solution: Dynamic Feedback

• Generated Code Consists of• Sampling Phases: Measure Performance of Different

Policies• Production Phases : Use Best Policy From Sampling

Phase

• Periodically Resample to Discover Changes in Best Policy

• Guaranteed Performance Bounds

Dynamic Feedback

AggressiveOriginalBounded

Time

Ove

rhea

d

Sampling Phase Production Phase Sampling Phase

AggressiveCodeVersion

Dynamic Feedback : Barnes-Hut

0

2

4

6

8

10

12

14

16

Spe

edup

0 2 4 6 8 10

12

14

16Number of Processors

Ideal

Aggressive

Dynamic Feedback

Bounded

Original

Barnes-Hut (16384 bodies)

Dynamic Feedback : Water

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Ideal

Bounded

Original

Aggressive

Dynamic Feedback

Water (512 Molecules)

Dynamic Feedback : String

String (BigWell Model)

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Ideal

Original

Aggressive

Dynamic Feedback

Related Work

• Parallel Loop Optimizations (e.g. [Tseng:PPoPP95])

• Array-based Scientific Computations• Barriers vs. Cheaper Mechanisms

• Concurrent Object-Oriented Programs (e.g. [PZC:POPL95])

• Merge Access Regions for Invocations of Exclusive Methods

• Concurrent Constraint Programming• Bring Together Ask and Tell Constructs

• Efficient Synchronization Algorithms• Efficient Implementations of Synchronization

Primitives

Conclusions

• Synchronization Optimizations• Basic Synchronization Transformations for Locks• Synchronization Optimization Algorithm

• Integrated into Prototype Parallelizing Compiler• Object-Based Programs with Dynamic Data Structures• Commutativity Analysis

• Experimental Results• Optimizations Have a Significant Performance Impact• With Optimizations, Applications Perform Well

• Dynamic Feedback


Recommended