Intel Concurrent Collections C++ API and runtime® Concurrent Collections for C++ ... −Garbage collection >Detect that an item will not be used any more >A semantic attribute, different

1 1

Software and Services Group

Concurrent Collections

Easy and effective distributed computing

Frank Schlimbach Kath Knobe

James Brodman

2012/03/06

2


2

• In this presentation the focus is on distributed memory

− Introduction to Concurrent Collections (CnC) concepts

− How it is used

− Results

− How it helps with distributed computing

• It also works well in shared memory

− Intel’s own benchmark effort not yet public

> CnC scored high across the board

This Talk

3


3

Outline

• Introduction to Concurrent Collections

• Introduction to using Intel® Concurrent Collections for C++

• distCnC: Concurrent Collection for distributed memory

4


4

What problem does CnC address?

Parallel programs are notoriously difficult for mortals to write, debug, maintain and port

Parallel programs are notoriously difficult for anyone to tune

5


5 5

The Big Idea

•Don’t specify what operations run in parallel

−Difficult and depends on target

6


6 6

The Big Idea

•Don’t specify what operations run in parallel

−Difficult and depends on target

•Specify the semantic ordering constraints only

−Easier and depends only on application

7


7 7

Producer - consumer

step1 step2 item

Exactly Two Sources of Ordering Requirements

• Producer must execute before consumer

COMPUTE STEP COMPUTE STEP DATA ITEM

8


8 8

Controller - controllee

Exactly Two Sources of Ordering Requirements

Producer - consumer

step1 step2 item

COMPUTE STEP COMPUTE STEP DATA ITEM

step1

COMPUTE STEP

step2

COMPUTE STEP

CONTROL TAG

tony

• Producer must execute before consumer

• Controller must execute before controllee

9


9

Methodology:

4 simple steps to a CnC application

10


10 10

1: The White Board Drawing: - computations

- data - producer/consumer relations - I/O

Cholesky

Trisolve

Update

COMPUTE STEP

COMPUTE STEP

COMPUTE STEP

Array

DATA ITEM

11


11 11

2: Distinguish among the instances

Cholesky: iter

Trisolve: row, iter

Update: col, row, iter

COMPUTE STEP

COMPUTE STEP

COMPUTE STEP

Array: col, row, iter

DATA ITEM

12


12 12

3: What are the control tag collections

TrisolveTag: row, iter

CholeskyTag: iter

UpdateTag: col, row, iter

CONTROL TAG

CONTROL TAG

CONTROL TAG

Cholesky: iter

Trisolve: row, iter


COMPUTE STEP

COMPUTE STEP

COMPUTE STEP

Array : col, row, iter

DATA ITEM

13


13 13

4: Who produces control

TrisolveTag: row, iter

CholeskyTag: iter

UpdateTag: col, row, iter

CONTROL TAG

CONTROL TAG

CONTROL TAG

Cholesky: iter

Trisolve: row, iter


COMPUTE STEP

COMPUTE STEP

COMPUTE STEP

Array : col, row, iter

DATA ITEM

14


14

Separation of Concerns

Mapping CnC spec to platform

CnC

Domain Spec

Application problem

The domain expert:

• Finance

• Gaming

• Chemistry

• …

The tuning expert:

• Parallelism

• Locality

• Load balancing

• …

15


15

Separation of Concerns

Mapping CnC spec to platform

CnC

Domain Spec

Application problem

Domain expert doesn’t need to know a lot about parallelism

Tuning expert doesn’t need to know a lot

about the app

The domain expert:

• Finance

• Gaming

• Chemistry

• …

The tuning expert:

• Parallelism

• Locality

• Load balancing

• …

16


16

CnC model allows for a wide variety of runtime approaches

grain distribution schedule

HP TStreams

Intel CnC

Georgia Tech CnC

HP TStreams

Rice CnC

static static

static

static

static

dynamic

static

dynamic dynamic

static

dynamic dynamic

dynamic

dynamic dynamic

17


17

Concurrent Collections Promise

• Separates concerns (domain and tuning)

− Increases productivity

• Domain language

− Determinism (with respect to the results)

− Easy to debug

− Unified programming applicable for shared and distributed memory

• Tuning language

− Provides effective tuning (not covered in this presentation)

18


18

Outline




19


19

Intel® Concurrent Collections for C++

• Template library with runtime

> For Windows(IA-32/Intel-64) and Linux (Intel-64)

> Shared and distributed memory

• Upcoming whatif-release end of Q1’12

− Header-files, shared libs

− Samples

− Documentation

> CnC concepts

> Using CnC

> Tutorials

− No translator

http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/

20


20

Intel® Concurrent Collections for C++

libc pthread …

TBB

CnC runtime

CnC C++ API

OS

CnC translator

MPI/ITC

21


21

C++ API

•The C++-API puts CnC concepts into C++

−Creating a CnC graph and using it

−Generality through collection being templates

−Type safety

−Debug and tuning interface

−Tuning options

−Range support for tags

−Running on distributed systems

22


22

How to write a CnC application

•Design

−Identify steps, Items and Tags (see previous slides)

•Write C++ code

−Create a context with all collections in it

−Each step is a user function with required interface

−A step-instance is prescribed by a put to the tag-collection

−Steps consume and produce item/tag instances with get and put

−Steps call gets before any puts and any memory allocation

−The environment produces initial tag/item instances and invokes the graph

23


23

Major classes provided by CnC C++ API

Tag collection: CnC::tag_collection< tag_type, tuner_type >

prescribes, put, put_range, iterate

Item collection: CnC::item_collection< tag_type, item_type, tuner_type >

put, get, iterate

Step collection: CnC::step_collection< step_type, tuner_type >

consumes, produces

Graph/context: CnC::context< DerivedContext >

Derive from CnC::context< yourContext >

Declare collections as members

Initialize collections in context constructor with context as argument

Prescribe your steps in context constructor

myStep COMPUTE STEP

myTag

CONTROL TAG

myItem

DATA ITEM

24


24

Sample Context

struct cholesky_context : public CnC::context< cholesky_context >

{

// Step Collections

CnC::step_collection< cholesky > sc_cholesky;

...

// Item collections

CnC::item_collection< triple, tile_const_ptr_type > Array;

// Tag collections

CnC::tag_collection< int > tc_cholesky;

...

cholesky_context( int _b = 0, int _p = 0, int _n = 0 )

{

tc_cholesky.prescribes( sc_cholesky, *this );

sc_cholesky.consumes( Array);

...

}

};

25


25

Sample step (cholesky) // Perform unblocked Cholesky factorization on the input block

// Output is a lower triangular matrix.

int cholesky::execute( const int & t, cholesky_context & c ) const

{

tile_const_ptr_type A_block;

tile_ptr_type L_block;

int b = c.b;

const int iter = t;

c.Array.get(triple(iter,col,row), A_block); // Get the input tile.

L_block = std::make_shared< tile_type >( b ); // allocate output tile

for(int k_b = 0; k_b < b; k_b++) {

// do all the math on this tile

}

c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile

return CnC::CNC_Success;

} execute must be const with no side-effects The step can have a global/constant status The step must be copy-constructible

Fixed parameters

Get’ing consumed data Computation proceeds

as normal

Put’ing produced data

26


26

Tuning •CnC tuning influences how the runtime manages the interaction between CnC entities (items, steps, tags)

−This is different and orthogonal to optimizations of the serial language (C++)

•The tuning expert gives hints to the runtime

−For a specific application

•Examples

−Garbage collection

>Detect that an item will not be used any more

>A semantic attribute, different from traditional GC

−Influence step scheduling

−Memoizing step execution

−Work/data distribution

•Much of the tuning potential yet to come

27


27

Outline




28


28

Why distributed CnC?

•CnC is a declarative and high level model

−Abstracts from memory model

−Facilitates switch between shared and distributed memory (no explicit message passing needed)

•CnC comes with a control methodology

−Allows controlling work distribution

•CnC comes with a data methodology

−Provides hooks for selective data distribution

CnC provides a unified programming model for shared and distributed memory.

29


29

Limitations of distributed computing apply

•Usual caveats for distributed memory apply

−E.g. ratio between data-exchange and computation

•Different algorithms might be needed for distributed or shared memory

Programming methodology and framework stays the same in any case

Over a wide class of applications the algorithm stays the same

30


30

Spike

•Parallel solver for banded linear systems of equation

•Complex algorithm, but “easily” maps onto CnC (at least for James Brodman )

• 15 Step Collections

• 24 Item Collections

• 12 Tag Collections

• Same code for shared and distributed memory

31


31

Spike Performance

0

1

2

3

4

5

6

1048576 w/ 128 Super/Sub Diagonals, 32 partitions

MKL MKL+OMP HTA+TBB CnC HTA+MPI DistCnC

Shared memory

Shared/distributed memory (4 nodes, GigE)

1048576 w/ 128 Super/Sub Diagonals, 32 partitions

MKL MKL +OMP

HTA (TBB)

CnC HTA (MPI)

DistCnC

Spike: Parallel solver for banded linear systems of equation Complex algorithm but “easily” maps onto CnC

Runtim

e (

S)

32


32

Spike Scalability

0.00

5.00

10.00

15.00

20.00

25.00

1 2 4 8 16 32

Tim

e [

sec]

#nodes (24 h-threads each)

Spike - Time (matrix: 10Mx10M, 257 bands)

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2 4 8 16 32

Eff

icie

ncy [

%]


Spike - Parallel Efficiency (matrix: 10Mx10M, 257 bands)

33


33

Making a CnC program distCnC-ready* • #include <cnc/dist_cnc.h>

− sets #define and declares dist_cnc_init template

• instantiate CnC::dist_cnc_init< … > object

−First thing in main, must persist throughout main

−Template parameters are the contexts used in the program

• serialization of non-standard data types (tags and items)

−Simple mechanism (similar to BOOST)

−int, double, float, char etc. don’t need explicit serialization

•Same binary runs on shared and distributed memory

*distCnC-readyness doesn’t guarantee good performance, but it enables execution on a distributed memory system.

34


34

DistCnC-ready performance (UTS)

• Unbalanced tree search

• Tree shape unknown in advance

• CnC code is trivial

• CnC: 151 loc

• Shmem: ~1000 loc

• MPI: ~800 loc

• CnC performs better on single node (multi-threaded)

• Performance gap in the mid-sized region is a load-balancing issue

• Experimental version solves it

0

2

4

6

8

10

12

16 32 64 96 128 160 192 224 256

1 2 4 6 8 10 12 14 16

sp

eed

up

over 1

no

de

#Threads/ranks

#Nodes

UTS [T3XXL] Speedup

MPI

CnC

35


35

distCnC-ready Performance

Default round-robin distribution can be arbitrarily bad.

0.00

50.00

100.00

150.00

200.00

250.00

1 2 4

Tim

e [

sec]


CnC Time (dist-ready) [GigE]

inverse

primes

mandelbrot

cholesky

36


36

CnC makes work and data distribution easy and efficient

•By default, a simple local round-robin scheduling is done

−Data is sent to where needed (requested)

•Tuner can declare where work is to be executed: int tuner::compute_on( step_tag )

•Similarly, tuner can declare where the data needs to go: int tuner::consumed_on( item_tag ) or

vector< int > tuner::consumed_on( item_tag )

•Both returning ranks/ids of address-spaces

•Mapping tag->rank can be computed

−Statically at compile time

−Dynamically at init-time

−Dynamically on the fly

37


37

Tuned Performance

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

2 4 8 16 32

Tim

e[sec]

#nodes (24h-threads each)

CnC Time (tuned) [IB]

inverse

primes

mandelbrot

cholesky

38


38

Scalability Comparison (cholesky)

CnC: Only a few lower-level optimizations

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 2 4 8 16 32

Sp

eed

up

over M

KL-L

AP

AC

K


Cholesky - Speedup (matrix 16kx16k, blocks 100x100)

MKL-LAPACK

CnC

MKL-scaLAPACK

39


39

Scalability Comparison (inverse)

0

1

2

3

4

5

6

7

1 2 4 8 16 32

Sp

eed

up

over M

KL-L

AP

AC

K

#nodes (24h-threads each)

Matrix inverse - Speedup (matrix: 16kx16k, blocks: 90x90)

MKL-LAPACK

CnC

MKL-scaLAPACK

CnC: No lower-level optimizations yet

Intel Confidential

40 40


Scalability (stencil)

3D 25-point stencil reads from 2 previous time-steps Shmem performance similar to cache-oblivious algorithms

0

5

10

15

20

25

30

35

2 4 8 16 32

Sp

eed

up


RTM Stencil - Speedup (3kx1.5k2, 32 time-steps)

60%

65%

70%

75%

80%

85%

90%

95%

100%

105%

1 2 4 8 16 32

3,2E+09 6,4E+09 1,3E+10 2,6E+10 5,1E+10 1,0E+11

12,0 24,0 47,9 95,8 191,7 383,3

Parallel E

ffic

ien

cy

# of nodes Matrix size [#elements]

Matrix size [GB]

RTM Stencil [weak] - Efficiency (16 time-steps)

Efficiency stays above 97%!

41


41

Increased Productivity when experimenting with distribution plans • Partitioning the work directly affects communication for controller/controlee

• The CnC methodology also declares the links between work- and data instances

• Allows clean mechanism for the tuning expert to create a distribution plan

−For each item, declare steps which depend on it (“consumed_by”)

−With compute_on the runtime knows where to send the item to

−Runtime auto-distributes data optimally according to declared work-distribution

−Depends on the program semantics only

int tuner::consumed_on( const tag t ) const

{

return compute_on( consumed_by( t ) );

}

42


42

Distribution function (cyclic1)

Used for primes

int tuner::compute_on( const int & i ) const

{

return i % numProcs();

}

Simple

43


43

Distribution function (cyclic2)

int tuner::compute_on( const int& stage,

const int& stream ) const

{

return stage % numProcs();

}

Good for pipeline if stages require larger data sets

Pipeline# (data parallel)

Sta

ges (

task p

aralle

l)

Used for mandelbrot

Pipelines (data parallel)

Sta

ges (

task p

aralle

l)

int tuner::compute_on( const int& stage,

const int& stream ) const

{

return stream % numProcs();

}

Good for pipeline if large chunks of data go through all stages

44


44

Distribution function (blocked)

int tuner::compute_on( const int& x, y, z ) const

{

return x * NBLOCKS_X / nx

+ ( y * NBLOCKS_Y / ny ) * nx

+ ( z * NBLOCKS_Z / nz ) * nx * ny

}

Minimizes data transfer.

Used for Cholesky Used for

stencil

45


45

Distribution function (blocked cyclic)

int tuner::compute_on( const int& stage, const int& stream ) const

{

return ( ( x + y * m_nx ) / BLOCKSIZE ) % numProcs();

}

Good if block-size and/or –shape is relevant. Good if flat parallel model (pure MPI, pure threading)

46


46

Distribution function (tree) int tuner::compute_on( const int & tag, context & c) const

{

return tag >= LIM

? CnC::COMPUTE_ON_LOCAL

: tag % numProcs();

}

At certain tree-depth stop distributing and stay local

16

1

2

8

4

17 18

9

19 20

10

21 22

11

23 24

12

25 25

13

26 27

14

28 29

15

30

5 6 7

3

Used for quickSort

47


47

Changing distribution (cholesky) int tuner::compute_on( ... ) {

switch( dt ) {

case BLOCKED_ROWS :

return (((j*j)/2+1+i)/s);

case BLOCKED_CYCLIC :

return ((i/2)*n+(j/2))%np;

case ROW_CYCLIC :

return i % np;

case COLUMN_CYCLIC :

return j % np;

}

}

Optimal distribution might depend on several factors CnC makes it easy to customize.

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 2 4 8 16 32

Sp

eed

up

over s

ing

le-n

od

e M

KL-L

AP

AC

K


Cholesky - Speedup (matrix 16kx16k, blocks 100x100)

MKL-LAPACK

CnC COLUMN_CYCLIC

CnC BLOCKED_ROWS

CnC ROW_CYCLIC

CnC BLOCKED_CYCLIC

MKL-scaLAPACK

48


48

Summary

CnC helps structuring a program so that

−It doesn’t limit parallelism and so exposes more potential parallelism

−Tuning can be separated and effective

−It can be optimized for shared and/or distributed memory

−In distributed systems specifically

>A key task of distributing data and work becomes easy

>minimizing transferred data volume becomes a nice side effect of a good distribution

−Development is productive

>The CnC runtime hides all the difficulties with using low-level techniques

>Result is deterministic and independent of #threads or #processes

49


49

The road ahead

•We are working on the tuning side

−A tuning language

−More convenient interfaces

•Runtime optimizations

•Spec-level optimizations

•Fault-tolerance

•Looking for feedback

•Looking for new areas to apply CnC to

Intel Confidential

51 51


Optimization Notice

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that

are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and

other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on

microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended

for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for

Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information

regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Intel Confidential

52 52


Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2011-2012. Intel Corporation.

http://intel.com/software/products

http://www.intel.com/software/products

http://intel.com/software/products

Documents

Intel Concurrent Collections C++ API and runtime® Concurrent Collections for C++ ... −Garbage collection >Detect that an item will not be used any more >A semantic attribute, different