Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science

Matrix Multiplication on Two Interconnected Processors

Brett A. Becker and Alexey Lastovetsky

Heterogeneous Computing Laboratory

School of Computer Science and Informatics

University College Dublin

_______________________________________________________

HeteroPar’06 Barcelona Sept. 28, 2006

Outline

● Motivation and Goals

● Introduction: ‘Straight-Line’ Partitionings

● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication

● MPI Experiments / Results

● Conclusion / Future Work

Motivation and Goals

● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors

● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy.

● Our ultimate interest is to determine if the Square-Corner partitioning

is a viable technique for deployment on 2 interconnected Clusters.

Background: Straight-Line Partitioning

p

iii whS

1

)(

Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S)

Lower Bound (L) of S is when all partitions are square

p

iiaL

1

2

Straight-Line Partitioning

From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.

Average and Minimum values of L

S

for two million randomly generated

areas

Background: Straight-Line Partitioning2 Processors

NwhwhwhSi

ii 3)( 2211

2

1

NL

NaLi

i

2,0 as

)(22 22

1

The Straight-Line Partitioning can not meet the lower bound, L

Background: Straight-Line Partitioning2 Processors

2TVC ,0 as N

Total Volume of Inter-Processor Communication (TVC) = N 2

Introduction: Square-Corner Partitioning

0TVC ,0 as X

N2TVC

Square-Corner Partitioning

NS

whwhwhSi

ii

2,0 as

)( 2211

2

1

NL

NaLi

i

2,0 as

)(22 22

1

The Square-Corner Partitioning can meet the lower bound, L

Square-Corner Partitioning

Average and Minimum values of L

Sfor 2 million randomly generated areas

Power Ratio > 3:1

Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.

Square-Corner PartitioningMinimizing the TVC

The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1

The Total Volume of Communication is minimized when the slower processor’s partition is a square

Theorem:

Theorem:

Results: Square-Corner Partitioning

Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s

Lower TVC Lower Communication Time Lower Execution Time

Average Reduction in Communication Time = 45%

Average Reduction in Execution Time = 14%

Results: Square-Corner Partitioning

Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s

Average Reduction in Communication Time = 44%

Lower TVC Lower Communication Time Lower Execution Time

Average Reduction in Execution Time = 10%

Square-Corner Partitioning Overlapping Communication and Computation

A sub-partition of Processor 1’s C Partition is Immediately Calculable

Square-Corner Partitioning Overlapping Communication and Computation

Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.

Algorithm Execution Time Speedup

Straight-Line 83s 0.94Square-Corner (No Overlapping) 69s 1.13Square-Corner (Overlapping) 51s 1.53Sequential 78s N/A

MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,

Square-Corner Partitioning Two Cluster Architecture

Total of 20 Homogeneous Nodes in 2 Clusters

Square-Corner Partitioning Two Clusters

Algorithm Execution Time Speedup

Straight-Line 123s 1.04Square-Corner 115s 1.11Sequential 128s N/A

MM Multiplication, N=9000, Bandwidth=100Mb/s

All Machines are Homogeneous. One Cluster of 4, One Cluster of 16

Conclusions

● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1

● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time

● The Square-Corner Partitioning is viable on Two Clusters

_______________________________________________________

Current and Future Work

● We have successfully extended the Square-Corner Partitioning to Three Processors

To do:

● Experiment on more Two-Cluster architectures

● Overlap Communication and Computation on Two Clusters

● Extend to Three-Processor Algorithm to Three Clusters

_______________________________________________________

Acknowledgements

This work was supported by:

Documents

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science