Fast Performance Prediction and Expansion of 3D Parallelism for Distributed DNN Training
- Title
- Fast Performance Prediction and Expansion of 3D Parallelism for Distributed DNN Training
- Authors
- 윤유경
- Date Issued
- 2024
- Publisher
- 포항공과대학교
- Abstract
- Training large-scale DNN models requires parallel distributed training using hyper-scale systems. To make the best use of the numerous accelerators, it is essential to intelligently combine different parallelization schemes. However, as the size of DNN models increases, the possible combinations of schemes become enormous, and consequently, finding the optimal parallel plan becomes exceedingly expensive and practically unfeasible. In this paper, I introduce a novel cost model, the Markovian Performance Estimator (MPE). This model provides affordable estimates of the throughput of various parallel plans, promoting efficient and fast searches for the ideal parallel plan, even when resources are limited. Significantly, this work is pioneering in explaining the expensive nature of searching for an optimal plan and addressing it using intuitive performance estimations based on real device evaluations. The experiments demonstrate the effectiveness of the MPE, revealing that it accelerates the optimization process up to 126x faster (36.4 on average) than the existing state-of-the-art baseline, Alpa. I also propose a new search space that combines 3D parallel and offloading to support LLMs larger than 3D parallelism-only, along with the low communication cost of offloading, as a future task.
- URI
- http://postech.dcollection.net/common/orgView/200000732943
https://oasis.postech.ac.kr/handle/2014.oak/123368
- Article Type
- Thesis
- Files in This Item:
- There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.