MPI Collectives Optimizationsmeseec.ce.rit.edu/756-projects/fall2016/1-3.pdf · MPICH, formerly known as MPICH2, is a freely available, portable ... communication. Collective networks

MPI Collectives Optimizations on HPC Rahul & Nikhil

CMPE – 655 Fall 2016

Instructor: Dr. Shaaban.

Dec 7, 2016.

Outline

Need for HPC?

Different factors governing HPC performance

Overview of HPC architectures

MPI, & MPICH..

Optimizing options in HPC

Optimizing MPI Collectives in IBM Blue Gene Q

Implementation of Optimizations on Blue Gene.

Results

Conclusion

History of Computers…

Need for High Performance

Anything above 1012 flops (teraflops) is categorized as super-computer. Need to move towards exascale billion billion calculations per second Exascale computing would be considered as a significant achievement in computer

engineering, for it is believed to be the order of processing power of the human brain at neural level (functional might be lower). It is, for instance, the target power of the Human Brain Project.

Used in.. Scientific Research Academic Institutional Research Government Agencies & Defense Complex modelling Simulation Biological research Astronomy Big data.

Factors governing performance in Computer Architecture Physical design

Frequency scaling

Macro Architecture & Micro Architecture Design

Leaning Towards Multi-core and Multi-processor Interconnect Design Network

Enhanced memory hierarchy

Optimizing systems calls for communications

Key Points for exascale design exploration.

MPI (Message Passing Interface)

MPI addresses primarily the message-passing parallel programming model, in which data is moved from the address space of one process to that of another process through cooperative operations on each process.

MPI is a specification, not an implementation. This specification is built on various languages such as C, C++, Fortran through library calls.

Goal: Design an API for efficient and reliable communication in heterogeneous environment which is easy to port in various programming languages such as C & Fortran.

Thus creates an open standard for implementing various internode communication without actually worrying about underlying architecture of the machine.

Open MPI is one of the implementations of MPI.

MPICH..

MPICH, formerly known as MPICH2, is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. MPICH is Free and open source software with some public domain components that were developed by a US governmental organization, and is available for most flavors of Unix-like OS (including Linux and Mac OS X).

The Argonne National Laboratory developed early versions (MPICH-1) as public domain software. The CH part of the name was derived from "Chameleon", which was a portable parallel programming library developed by William Gropp, one of the founders of MPICH.

IBM Blue Gene uses MPICH implementation of MPI Standard.

IBM Blue Gene Q Design Overview

System can Scale to more than 262,144 nodes

Con perform more than 100 PF peak.

Highest power efficiency (Green Computer)

Low Latency High Bandwidth Interconnect network

Low Latency High Bandwidth Memory System

Open Source Standards based on programming RHEL, SIMD, C, C++ Compilers, PAMI

Specs

16 user + 1 operating System + 1 redundant/redundant processor 4 threads per node with individual register file. 1.6 GHz @ 0.8V 64 bit Power ISA L1 I/D cache = 16kB/16kB Peak performance 204.8 GFLOPS @ 55W Centralized shared L2 cache 32 MB. Chip to Chip networking 5D torus In-Order execution Dynamic branch prediction Double Precision & SIMD

Blue Gene Inter-process Communication Network Performance

All to All 07% of peak’ Bisection > 93 % of peak. Collective reductions 94.6 % of peak Nearest neighbor 98% of peak Measured at 1.76 GB/s per link.

Integrated 5D Tori Virtual Cut Through Routing RDMA (Integrated on chip messaging Unit)

Hardware Latency (96 rack 20PF system) Nearest 80 ns Farthest 3us

Optimization Approach for MPI Collective in Blue Gene Techniques to optimize various MPI collective operations on Blue Gene Q

using PAMI(Parallel Active Messaging Interface) asynchronous library together with several new hardware features in a torus network.

This approach helps in achieving higher throughput and lower latency for MPI collectives such as MPI Allreduce, MPI broadcast and MPI All to All.

For more than one process per node in a MPI job shared memory technique is being utilized.

For MPI reduce main challenge is to sum incoming packet with local buffer and redirect the output in parent node. This is achieved through hardware parallel threads.

Blue Gene Hardware Features

Scalable Atomics Integer adders for load increment, store add etc.

Wake Up Unit Threads can go in wait state (sleep) by wait instruction and not consume any

resources.

Network Architecture 5D torus

Each node has 10 torus link + 1 I/O link capable of simultaneously transmitting & receiving 2GB/s

Software System

Optimized MPICH

PAMI messaging libraries

Communication threads During execution, comm-threads detect conditions where no communications

are going on invoke wait instruction, thus eliminating impact on other compute threads.

When main thread reaches data movement or communications phase in application it can generate a work request and post it to work queue in wake region o shared memory

After this comm-thread wake up and performs actual work.

Parallel Active Messaging Interface (PAMI) Goals

Novel Techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads

Client & context objects to support multiple and different programming paradigms

Lockless algorithm to speed up MPI messaging rate

Novel techniques leveraging the new Blue Gene Q hardware features.

PAMI Architecture

Supporting multiple programming models

PAMI Client: It encapsulates all communication data structures, such as contexts and endpoints, communication progress models and the networking/messaging unit resources such as access to collective tree.

PAMI Client instantiates one or more communication context.

Context is collection of software communication devices where progress is made by application thread or communication thread.

A thread safe work queue provides an efficient lock-less hand-off mechanism between application threads and communication threads.

Overview of PAMI

PAMI Flow

MPI Collectives on Blue gene Q using PAMI Collectives such as MPI_Barrier, MPI_Bcast, MPI_Reduce and MPI_Allreduce

are called MPI colllectives.

They directly use features of hardware optimizations and PAMI library calls for optimized implementation of message passing.

Collective Algorithms

Rich set of network hardware acceleration for different set of colletive operations. These algorithms use PAMI library calls along with MPI.

Binomial algorithm

Rectangle algorithm

Collective network

Collective network In each node all the processor use shared memory for intra-node

communication.

Collective networks are further supported by RDMA , which moves data between the nodes , but without making redundant copies or using CPU cycles.

Exploit global interrupt and collective tree network

For Broadcast all the nodes provides a master process which communicates via network with other nodes, this master receives broadcast message and copies it to shared memory which is accessed by all local processes.

For reduce operation all local processes perform logical operation to obtain reduced result for each node, this result is communicated further by master.

optimize barrier, broadcast, reduction

Rectangle algorithm Target torus network with efficient line broadcast

Short-rectangle, multicolor-spanning tree

Short-rectangle Algorithm

Nodes perform line broadcasts along A torus dimension and process all incoming X packets. Repeated for B,C,D and E dimensions

Nodes send five messages, optimized for short messages

This is used when class routes of collective network are already allocated.

for barrier and allreduce

Multi-color rectangle Algorithm

Uses multiple edge-disjoint routes from root to all nodes to perform collectives simultaneously, optimized for large messages

Each route is represented by a color, this route carries a portion of large message.

Multi-color rectangle Algorithm

The data movement along each color is independent thus communication is concurrent.

The broadcast of data along the lines can be observed phase wise. In first phase root sends different portion of buffer along different colors.

In next phase each receiving node broadcasts this data along one of its axis.

Different color follow different routes to reach all the nodes.

Each nodes arrange the portions in order using the message header.

Allreduce communication performs reduction followed by broadcast.

For broadcast and allreduce

Binomial algorithm Uses torus point-to-point links

Suitable for irregular communicators

Log2(N) complexity

For barrier, broadcast, allreduce and reduce

Collective Operations for Blue Gene

Barrier

Broadcast

All reduce

All to All

MPI_Barrier

Algorithm for Barrier operation is selected on the basis of communicator used.

For MPI_COMM_WORLD and rectangular sub-communicators , collective network algorithm is used which takes advantage of dedicated hardware.

In case these communicators are not available short rectangle algorithm is used.

For non rectangular communicators , binomial algorithm gives better performance.

For multiple process per node shared memory algorithm is used for barrier within the node , only after which network barrier is initiated.

Broadcast

For short broadcast messages collective algorithm on MPI_COMM_WORLD and rectangular sub-communicators has lowest latency.

For large broadcast messages multi-color algorithm on MPI_COMM_WORLD and rectangular sub-communicators has better performance.

For non rectangular communicators , binomial recursive doubling algorithm for short messages and pipelined binary tree for large messages are used.

For multiple process per node shared memory algorithm is used with multi-color algorithm and collective algorithms.

MPI_AllReduce

For non rectangle communicators binomial algorithm is used for small messages. For large messages MPICH allreduce algorithm is used.

All three algorithms are being used for MPI_AllReduce

Multi-color rectangle allreduce

MPI_Alltoall

For MPI_COMM_WORLD and rectangular sub-communicators hardware acceleration improves the performance. Messaging unit descriptor are injected into FIFO

Important for Fast Fourier transform applications.

A random permutation using low overhead random number generator is used to inject the descriptors in a random fashion to smooth hot-spots on the torus network.

Zone routing that results in high network utilization for dynamic routing.

Results & Analysis

Conclusion

MPI calls can be optimized by used PAMI calls, various hardware features and algorithms used to handle them.

With this approach it can be seen that close to peak performance can be achieved for MPI collective operations.

Future supercomputing will have to look into hardware software design and also optimizing library calls which handles message passing.

References

All results of collective algorithm on Blue Gene are work from IBM

Optimization of MPI collective operations on IBM Blue gene/Q super computer by Sameer Kumar, Amith Mamidala, Philip Heidelberger, Dong Chen and Daniel Faraj.

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Super computer by Sameer kumar, Amith Mamidala, Brian Smith, Michael Blocksome, Bob Cernohous, Daniel Faraj, Douglas Miller, Jeff Parker, Joseph Ratterman, Phillip Heidelberger, Dong Chen and Burkhard Steinmacher-Burrow.

Documents

MPI Collectives Optimizationsmeseec.ce.rit.edu/756-projects/fall2016/1-3.pdf · MPICH, formerly known as MPICH2, is a freely available, portable ... communication. Collective networks