View
215
Download
0
Embed Size (px)
Citation preview
Matrix Multiplication on Two Interconnected Processors
Brett A. Becker and Alexey Lastovetsky
Heterogeneous Computing Laboratory
School of Computer Science and Informatics
University College Dublin
_______________________________________________________
HeteroPar’06 Barcelona Sept. 28, 2006
Outline
● Motivation and Goals
● Introduction: ‘Straight-Line’ Partitionings
● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication
● MPI Experiments / Results
● Conclusion / Future Work
Motivation and Goals
● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors
● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy.
● Our ultimate interest is to determine if the Square-Corner partitioning
is a viable technique for deployment on 2 interconnected Clusters.
Background: Straight-Line Partitioning
p
iii whS
1
)(
Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S)
Lower Bound (L) of S is when all partitions are square
p
iiaL
1
2
Straight-Line Partitioning
From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Average and Minimum values of L
S
for two million randomly generated
areas
Background: Straight-Line Partitioning2 Processors
NwhwhwhSi
ii 3)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Straight-Line Partitioning can not meet the lower bound, L
Background: Straight-Line Partitioning2 Processors
2TVC ,0 as N
Total Volume of Inter-Processor Communication (TVC) = N 2
Introduction: Square-Corner Partitioning
0TVC ,0 as X
N2TVC
Square-Corner Partitioning
NS
whwhwhSi
ii
2,0 as
)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Square-Corner Partitioning can meet the lower bound, L
Square-Corner Partitioning
Average and Minimum values of L
Sfor 2 million randomly generated areas
Power Ratio > 3:1
Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Square-Corner PartitioningMinimizing the TVC
The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1
The Total Volume of Communication is minimized when the slower processor’s partition is a square
Theorem:
Theorem:
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Communication Time = 45%
Average Reduction in Execution Time = 14%
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s
Average Reduction in Communication Time = 44%
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Execution Time = 10%
Square-Corner Partitioning Overlapping Communication and Computation
A sub-partition of Processor 1’s C Partition is Immediately Calculable
Square-Corner Partitioning Overlapping Communication and Computation
Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.
Algorithm Execution Time Speedup
Straight-Line 83s 0.94Square-Corner (No Overlapping) 69s 1.13Square-Corner (Overlapping) 51s 1.53Sequential 78s N/A
MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,
Square-Corner Partitioning Two Cluster Architecture
Total of 20 Homogeneous Nodes in 2 Clusters
Square-Corner Partitioning Two Clusters
Algorithm Execution Time Speedup
Straight-Line 123s 1.04Square-Corner 115s 1.11Sequential 128s N/A
MM Multiplication, N=9000, Bandwidth=100Mb/s
All Machines are Homogeneous. One Cluster of 4, One Cluster of 16
Conclusions
● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1
● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time
● The Square-Corner Partitioning is viable on Two Clusters
_______________________________________________________
Current and Future Work
● We have successfully extended the Square-Corner Partitioning to Three Processors
To do:
● Experiment on more Two-Cluster architectures
● Overlap Communication and Computation on Two Clusters
● Extend to Three-Processor Algorithm to Three Clusters
_______________________________________________________
Acknowledgements
This work was supported by: