116
HTA’s NOV 30 PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC 1 , Universidade da Coruna 2 and IBM T.J. Watson Research Center 3 by Ganesh Bikshandi 1 , Jia Guo, Daniel Hoeflinger 1 , Gheorghe Almasi 3 , Basilio B. Fraguela 2 , María J. Garzarán 1 , David Padua 1 and Christoph von Praun 3 PAPER PUBLISHED AT PPOPP MARCH 2006

PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

Embed Size (px)

Citation preview

Page 1: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTA’sNOV30

PROGRAMMING FOR PARALLELISM AND LOCALITY WITH

PRESENTATION BY ROMAN FRIGG

Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson Research Center3 byGanesh Bikshandi1, Jia Guo, Daniel Hoeflinger1, Gheorghe Almasi3, Basilio B. Fraguela2, María J. Garzarán1, David Padua1 and Christoph von Praun3

PAPER PUBLISHED AT PPOPP MARCH 2006

Page 2: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY

Page 3: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY

Parallelism

Page 4: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY

Parallelism Locality

Page 5: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY

Parallelism Locality

Abstractions

Page 6: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY

Parallelism Locality

Abstractions

HTA’s

Page 7: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CLASSIFICATION

3|

Page 8: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

LIBRARIES LANGUAGES

HTA

MPI/PVM

GAS

POET

POOMA X10 CAFZPL

TITANIUM

UPC HPFCLASSIFICATION

3|

Page 9: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

LIBRARIES LANGUAGES

HTA

MPI/PVM

GAS

POET

POOMA X10 CAFZPL

TITANIUM

UPC HPFCLASSIFICATION

3|

‣ Library‣ Matlab & C++

‣ Single threaded, global view

Page 10: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO1

4|

Page 11: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

1

2

4|

Page 12: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

1

2

3

4|

Page 13: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

EVALUATION

1

2

3

4

4|

Page 14: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

EVALUATION

CONCLUSIONS

1

2

3

4

5

4|

Page 15: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

RECURSIVE TILING

‣ distributed

image source: paper

5| HOW HTA’s

WORK 2

9 H TA3/20/06

Hierarchically Tiled Array

9 H TA3/20/06

Hierarchically Tiled Array

‣ local

‣ local

Page 16: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6| HOW HTA’s

WORK 2

Page 17: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

T1 = hta( ,{ , } )

HOW HTA’s WORK 2

Page 18: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

T1 = hta( ,{ , } )M

HOW HTA’s WORK 2

Page 19: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

T1 = hta( ,{ , } )M [1 3 5]

HOW HTA’s WORK 2

Page 20: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

T1 = hta( ,{ , } )M [1 3 5]

HOW HTA’s WORK 2

Page 21: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

Page 22: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

Page 23: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )

Page 24: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1

Page 25: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2]

Page 26: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2]

1

2

Page 27: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1

2

Page 28: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2

Page 29: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2

[2 2]

Page 30: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONSTRUCT HTA FROM 6x6 MATRIX

6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

P1 P2

P3 P4

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2

[2 2]

Page 31: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C=

HOW HTA’s WORK 2

Page 32: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

Page 33: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

Page 34: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

Page 35: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

Page 36: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 37: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 38: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 39: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 40: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 41: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 42: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 43: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 44: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 45: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

Page 46: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4)=

Page 47: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4)= C{2,1}(2,4)=

Page 48: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

8|

VALID OPERATIONS

Scalar

⊕Array

HTA

Page 49: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

VALID OPERATION ?

Page 50: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

*4x4 HTA 2x3 Array

VALID OPERATION ?

Page 51: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

u*4x4 HTA 2x3 Array

VALID OPERATION ?

Page 52: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

*4x4 HTA 3x2 Array

VALID OPERATION ?

Page 53: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

v*4x4 HTA 3x2 Array

VALID OPERATION ?

Page 54: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

*4x4 HTA Scalar

VALID OPERATION ?

Page 55: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

u*4x4 HTA Scalar

VALID OPERATION ?

Page 56: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

*4x4 HTA 4x4 HTA

VALID OPERATION ?

Page 57: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

u*4x4 HTA 4x4 HTA

VALID OPERATION ?

Page 58: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

=4x4 HTA 4x4 HTA

VALID OPERATION ?

Page 59: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

ASSIGNMENTS & BINARY OPERATORS

9|

v=

4x4 HTA 4x4 HTA

VALID OPERATION ?

Page 60: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

CONCLUSIONS

1

2

3

5

10|

EVALUATION

4

Page 61: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TWOKINDS OF

OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

Page 62: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

Page 63: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

Page 64: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

Assignments, repmat, circshift, permute

Page 65: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x)g(x)g(x)g(x)g(x)

Assignments, repmat, circshift, permute

Page 66: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x) g(x)

g(x)g(x)

Assignments, repmat, circshift, permute

Page 67: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

f(x)TWO

KINDS OF OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x) g(x)

g(x)g(x)

Assignments, repmat, circshift, permute

parHTA(@g(x), H)

Page 68: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

Page 69: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A,B,C

Page 70: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

Page 71: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

circshift( )circshift( )

circshift( )circshift( )

Page 72: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

Page 73: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

Initialization

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

Page 74: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CANNON’S ALGORITHM

12| HTA OPERATIONS & APPLICATIONS

3

Initialization

Iteration

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

Page 75: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

Page 76: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

i=2

Page 77: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

i=2

Page 78: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

i=2

Page 79: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=2

Page 80: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=3

Page 81: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=3

Page 82: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=3

Page 83: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=3

Page 84: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23

A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

i=3

Page 85: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

B23A32

13| HTA OPERATIONS & APPLICATIONS

3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13

B12

i=3

Page 86: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

Page 87: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

Page 88: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

Page 89: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

Page 90: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

Page 91: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=1

Page 92: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

Page 93: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

Page 94: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

C32

C11 C12 C13

C22 C23

C33

C21

C31

14| HTA OPERATIONS & APPLICATIONS

3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

Page 95: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

CONCLUSIONS

1

2

3

5

15|

EVALUATION

4

Page 96: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper

16| EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA

1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A

Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.

tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly

parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:

Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.

Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial

execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel

MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,

the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.

4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-

tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,

and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (

, or ). The function is recursive. When the input

53

Page 97: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper

16| EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA

1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A

Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.

tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly

parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:

Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.

Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial

execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel

MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,

the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.

4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-

tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,

and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (

, or ). The function is recursive. When the input

53

Too many numbers !

Page 98: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

EP

25 %

100 %

Matlab+HTA Fortran+MPI

ebarassinglyparallel

# processors

speedup factor

sequential speed

17|

128 3.2 GHz Intel Xeons, Gigabit Ethernet

EVALUATION

4

Matlab+HTA Fortran+MPI

Page 99: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

EP LINEAR SPEEDUP

25 %

100 %

Matlab+HTA Fortran+MPI

ebarassinglyparallel

# processors

speedup factor

sequential speed

17|

128 3.2 GHz Intel Xeons, Gigabit Ethernet

EVALUATION

4

Matlab+HTA Fortran+MPI

Page 100: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

FFT

21 %

100 %

Matlab+HTA Fortran+MPI

fast fouriertransform

18| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 101: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

FFT

HTA’s SCALE BETTER

21 %

100 %

Matlab+HTA Fortran+MPI

fast fouriertransform

18| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 102: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

CG95 % 100 %

Matlab+HTA Fortran+MPI

conjugategradient

19| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 103: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

CG MPI SUPER L INEAR SPEEDUP

95 % 100 %

Matlab+HTA Fortran+MPI

conjugategradient

19| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 104: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0

32

64

96

128

0 32 64 96 128

MG

3 %

100 %

Matlab+HTA Fortran+MPI

multigrid

20| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 105: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTA’s

SLOW

0 32 64 96 1280

32

64

96

128

MG

3 %

100 %

Matlab+HTA Fortran+MPI

multigrid

20| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 106: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0

32

64

96

128

0 32 64 96 128

LU

6 %

100 %

Matlab+HTA Fortran+MPI

lufactorization

21| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 107: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0

32

64

96

128

0 32 64 96 128

LU

SLOW AGAIN

6 %

100 %

Matlab+HTA Fortran+MPI

lufactorization

21| EVALUATION

4

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 108: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

0 32 64 96 1280

32

64

96

128

LU

SLOW AGAIN

6 %

100 %

Matlab+HTA Fortran+MPI

lufactorization

21| EVALUATION

4

no data for 128 proce

ssors

128 3.2 GHz Intel Xeons, Gigabit Ethernet

Matlab+HTA Fortran+MPI

# processors

speedup factor

sequential speed

Page 109: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PERFORMANCE OF C++ HTA’s

22| EVALUATION

4

MMM

504 1008 2016 3024 40320

1000

2000

3000

4000

matrix size

MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

Page 110: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

PERFORMANCE OF C++ HTA’s

22| EVALUATION

4

MMM

504 1008 2016 3024 40320

1000

2000

3000

4000

matrix size

MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

8-13.5%

Page 111: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

LINES OF CODE

COMPARISON

ep cg mg ft lu0

200

400

600

800

1000

1200

lines

of c

ode

HTA MPIHTA

MPI

HTA

MPI

HTA

MPIHTA

MPI

ComputationCommunicationData Decomposition

image source: paper

23| EVALUATION

4

Page 112: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

CONCLUSIONS

1

2

3

5

24|

EVALUATION

4

Page 113: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

CONCLUSIONS

25| CONCLUSIONS

5

SCALABILITY PORTABILITY

PRODUCTIVITY

HTA’s

Page 114: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

FURTHER INFORMATION

26| CONCLUSIONS

5

http://polaris.cs.uiuc.edu/hta/

Page 115: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

THANKS.FOR YOUR ATTENTION

Page 116: PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

&AQPUT YOUR QUESTIONS