Efficient Process Arrival Pattern Aware Collective Communication for HPC and Deep Learning

Loading...
Thumbnail Image

Date

Authors

Mohammadalizadehbakhtevari, Pedram

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

High-Performance Computing (HPC) is the key to tackle computationally intensive problems such as Deep Learning (DL) and scientific applications. Message Passing Interface (MPI) is the de facto parallel programming standard that fulfills communication in HPC systems. MPI collective communication operations involve all processes within a program-defined group of processes and have been used extensively in parallel applications. Unfortunately, most of the studies attempting to improve the performance of collective operations are based on the premise that all the processes commence the communication simultaneously. However, researchers have shown that imbalanced Process Arrival Pattern (PAP) is ubiquitous in real environments. Therefore, it is important to propose new algorithms that improve the performance of collectives by PAP-awareness. This thesis presents a complete communication characterization of Horovod as one of the most famous distributed DL frameworks. We provide a thorough study on PAP of the MPI_Allreduce as the most important collective operation used in Horovod and show that the arrival pattern of MPI processes is indeed imbalanced especially for small messages. Furthermore, we present various proposals for improving the MPI collective communication performance in the presence of imbalanced PAPs for different message sizes. We propose an intra-node PAP-aware shared-memory-aware MPI_Allreduce algorithm for small messages that, based on the arrival time of the processes at each invocation of the collective call, dynamically chooses the leader process. The evaluation results show that our design delivers up to 56% improvement over native algorithms under different imbalanced PAPs. We also propose a PAP-aware algorithm capable of dynamically constructing the reduction schedule at each invocation of the collective call based on the arrival order of the processes for intra-node MPI_Reduce and MPI_Allreduce collectives with large messages, achieving up to 73%, and 44% improvement over the state-of-the-art algorithms, respectively. Finally, we evaluate the performance of two state-of-the-art cluster-wide MPI_Allreduce algorithms and introduce a PAP-tolerant cluster-wide allreduce algorithm which imposes less data dependency among processes given its hierarchical nature compared to flat algorithms. This algorithm delivers up to 58% improvement at the microbenchmark level and an average improvement of 10% for Horovod DL application over native algorithms.

Description

Keywords

MPI, Collective Communication, Deep Learning Frameworks, MPI_Allreduce, Process Arrival Pattern

Citation

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as CC0 1.0 Universal