Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Fast Performance Prediction and Expansion of 3D Parallelism for Distributed DNN Training

Title
Fast Performance Prediction and Expansion of 3D Parallelism for Distributed DNN Training
Authors
윤유경
Date Issued
2024
Publisher
포항공과대학교
Abstract
Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this paper, I introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. The experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa. I also propose a new search space that combines 3D parallel and offloading to support LLMs larger than 3D parallelism-only, along with the low communication cost of offloading, as a future task.
URI
http://postech.dcollection.net/common/orgView/200000732943
https://oasis.postech.ac.kr/handle/2014.oak/123368
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse