Transcript

Active Harmony and the Chapel HPC

LanguageRay Chen, UMD

Jeff Hollingsworth, UMDMichael P. Ferguson, LTS

Harmony Overview• Harmony system based on feedback loop

2

Harmony Server

Application

ParameterValues

MeasuredPerformance

Simplex AlgorithmsNelder-Mead

Parallel Rank Ordering

3

Tuning Granularity• Initial Parameter Tuning

o Application treated as a black boxo Test parameters delivered during application launcho Application executes once per test configuration

• Internal Application Tuningo Specific internal functions or loops tunedo Possibly multiple locations within applicationo Multiple executions required to test configurations

• Run-time Tuningo Application modified to communicate with server mid-runo Only one run of the application needed

4

Example Application• SMG2000

o 6-dimensional spaceo 3 tiling factorso 2 unrolling factorso 1 compiler choice

o 20 search steps

• Performance gaino 2.37x for residual computationo 1.27x for on full application

5

The Irony of Auto-Tuning

• Intensely manual processo High cost of adoption

• Requires application specific knowledgeo Tunable variable identificationo Value range determinationo Hotspot identificationo Critical section modification at safe points

• Can auto-tuning be more automatic?6

Towards AutomaticAuto-tuning

• Reducing the burden on the end-user

• Three questions must be answeredo What parameters are candidates for auto-tuning?o Where are the best code regions for auto-tuning?o When should we apply auto-tuning?

7

Our Goals• Maximize return from minimal investment

o Use profiling feature as a modelo Should be enabled with a runtime flag

o Aim to provide auto-tuning benefits within one execution

• Minimize language extensiono Applications should be used as originally written

• Non-trivial goals with C/C++/Fortrano Are there any alternatives?

8

Chapel Overview• Parallel programming language

o Led by Cray Inc.o “Chapel strives to vastly improve the programmability of large-

scale parallel computers while matching or beating the performance and portability of current programming models like MPI.”

9

Type of HW Parallelism Programming Model Unit of Parallelism

Inter-node MPI executableIntra-node/multi-core OpenMP/pthreads iteration/taskInstruction-level vectors/threads

pragmas iteration

GPU/accelerator CUDA/OpenCL/OpenAcc SIMD function/taskContent courtesy of Cray Inc.

Chapel Methodology

10Content courtesy of Cray Inc.

Chapel Data Parallelism

• Only domains and forall loop requriedo Forall loop used with arrays to distribute worko Domains used to control distribution

o A generalization of ZPL’s region concept

11Content courtesy of Cray Inc.

Chapel Task Parallelism

• Three constructs used to express control-based parallelism

o begin – “fire and forget”o cobegin – heterogeneous taskso coforall – homogeneous tasks

12

begin writeln(“hello world”);writeln(“good bye”);cobegin { consumer(1); consumer(2); producer();} // wait here for all three tasks to complete

begin producer();coforall 1 in 1..numConsumers { consumer(i);} // wait here for all consumers to return

Content courtesy of Cray Inc.

Chapel Locales

• MPI (SPMD) Functionality

13

writeln(“start on locale 0”);onLocales(1) do writeln(“now on locale 1”);writeln(“on locale 0 again”);

proc main() { coforall loc in Locales do on loc do MySPMDProgram(loc.id, Locales.numElements);}

proc MySPMDProgram(me, p) { println(“Hello from node ”, me);}

Content courtesy of Cray Inc.

Chapel Config Variables

14

config const numLocales: int;const LocaleSpace: domain(1) = [0..numLocales-1];const Locales: [LocaleSpace] locale;

% a.out --numLocales=4Hello from node 3Hello from node 0Hello from node 1Hello from node 2

Content courtesy of Cray Inc.

Leveraging Chapel• Helpful design goals

o Expressing parallelism and locality is the user’s responsibilityo Not the compiler’s

• Chapel source effectively pre-annotatedo Config variables help to locate candidate tuning parameterso Parallel looping constructs help to locate hotspots

15

Current Progress• Harmony Client API ported to Chapel

o Uses Chapel’s foreign function interfaceo Chapel client module to be added to next Harmony release

• Achieves the current state of auto-tuningo What to tune

o Parameters must determined by a domain experto Manually register each parameter and value range

o Where to tuneo Critical loop must be determined by a domain experto Manually fetch and report performance at safe points

o When to tuneo Tuning enabled once manual changes are complete

16

Improving the “What”• Leverage Chapel’s “config” variable type

o Helpful for everybody to extend syntax slightly

• Not a silver bulleto False-positives and false-negatives definitely existo Goes a long way towards reducing candidate variableso Chapel built-in candidate variables

config const someArg = 5;

17

dataParTasksPerLocaledataParIgnoreRunningTasksdataParMinGranularitynumLocales

config const someArg = 5 in 1..100 by 2;

Improving the “Where”

• Naïve approacho Modify all parallel loop constructs

o Fetch new config values at loop heado Report performance at loop tail

o Use PRO to efficiently search parameter space in parallel• Poses open questions

o How to know if config values are safe to modify mid-execution?o How to handle nested parallel loops?o How to prevent overhead explosion?

• Solutions outside the scope of this projecto But we’ve got some ideas...

18

What’s Possible?• Target pre-run optimization instead

o Run small snippet of code pre-maino Determine optimal values to be used prior to execution

• Example: Cache optimizationo Explore element size and strideo Pad array elements to fit sizeo Define domains

o Automatically optimize for cache size and eviction strategyo Further increase performance portability

• Generate library of performance unit-testso Bundle with Chapel for distribution

19

Improving the “When”• Auto-tuning should be simple to enable

o Use profiling as a model (just add –pg to the compiler flags)

• System should be self-relianto Local server must be launched with application

20

Open Questions• Automatic hotspot detection

o Time spent in loopo Variables manipulated in loopo How to determine correctness-safe modification points

o Static analysis?• Moving to other languages

o C/Fortran lacking needed annotationso More static analysis?

• Why avoid language extension?o Is it really so bad?

21


Recommended