Novel Allreduce Algorithms for Distributed Deep Learning

dc.contributor.authorKitor, Benen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.contributor.supervisorAfsahi, Ahmad
dc.date.accessioned2023-04-29T16:36:29Z
dc.date.available2023-04-29T16:36:29Z
dc.degree.grantorQueen's University at Kingstonen
dc.description.abstractTo solve the most computationally challenging problems, scientists and engineers use HPC clusters. These massive-scale systems comprise thousands of servers, each containing multiple GPUs and CPUs, tied together with a web of interconnect technologies like PCIe, InfiniBand, and NVLink. As a result, HPC systems are incredibly complicated and, by extension, challenging to program. To minimize this burden, developers leverage the MPI. This programming model abstracts the hardware and provides powerful tools like collective communications for exchanging data between processes. HPC is typically associated with scientific computations, classic examples include molecular dynamics or cosmological simulations, but recently DL has become an application of interest. DL is a subset of ML that leverages massive model and dataset size to achieve incredible performance. DL has revolutionized several fields, including computer vision and natural language processing, and will likely be a critical technology in the future. Since the concept of DL targets large scale, it benefits significantly from running on HPC clusters. We investigate the state-of-the-art in large-scale DL and identify Horovod, a commonly used MPI-based data parallelism library. Horovod spends large fractions of its runtime performing MPI_Allreduce, a collective communication. To address this, we propose two new allreduce methods, each leveraging different HPC techniques. First, we describe topology aware rank reordering for allreduce and broadcast. This method takes the communication pattern of the specified collective algorithm and re-ranks processes to ensure the hardware is used optimally. Evaluation with micro-benchmarks demonstrates up to an 80% improvement, depending on message size and initial process-to-core mappings. The second proposed technique is a multinode PAP allreduce. PAP awareness allows the algorithm to introspect on which processes have arrived and can participate in the collective and optimize communication accordingly. To achieve this, we propose a novel remote memory access based PAP distribution mechanism and an accompanying hierarchical chain allreduce. The micro-benchmark evaluation shows a 20% improvement over state-of-the-art collective libraries like UCC and NCCL under imbalance. Furthermore, performance improvements with Horovod demonstrate that data parallel training can benefit greatly from PAP awareness.en
dc.description.degreeM.A.Sc.en
dc.identifier.urihttp://hdl.handle.net/1974/31584
dc.language.isoengen
dc.relation.ispartofseriesCanadian thesesen
dc.subjectMPIen
dc.subjectCollective Communicationsen
dc.subjectDeep Learningen
dc.titleNovel Allreduce Algorithms for Distributed Deep Learningen
dc.typethesisen

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kitor_Benjamin_W_202304_MASc.pdf
Size:
966.54 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.67 KB
Format:
Item-specific license agreed upon to submission
Description: