Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Memory Access-Triggered Near-Data Processing for Accelerating DNN Training on GPUs

Title
Memory Access-Triggered Near-Data Processing for Accelerating DNN Training on GPUs
Authors
조현욱
Date Issued
2023
Publisher
포항공과대학교
Abstract
In training DNNs, memory/communication-bound operations can account for a significant portion of runtime due to limited the off-chip bandwidth (BW) of GPUs. To address the challenge, I propose a novel memory access-triggered near-data processing (mtNDP) architecture. With mtNDP, normal memory accesses also serve as implicit NDP requests to enable NDP without any changes in the core ISA/microarchitecture, core-side SW, or memory protocol, overcoming the practicality limitations of prior approaches. In addition, mtNDP enables on-the-fly NDP where the data already supplied in normal memory access packets for compute-bound operations are also simultaneously used for NDP: thus, mtNDP can reduce memory traffic. Moreover, by overlapping NDP kernels with compute-bound kernels, memory BW underutilized by GPU cores can be used by mtNDP units to improve performance, even if total memory BW is not increased. The mtNDP units can be deployed to heterogeneous memory devices in a system. First, I deploy them near GPU’s memory controllers. With on-the-fly mtNDP, compute-bound kernels can be overlapped with memory-bound kernels, even if they have dependencies, to achieve significant speedups. Secondly, my NDP units can be deployed in memory expanders that are connected to multiple GPUs to create an NDP-enabled memory eXpander Network (NDPXNet). It can entirely offload gradient reduction and the optimizer in data-parallel training, achieving additional speedups while eliminating redundancy in memory usage and optimizer execution. To the best of my knowledge, this work is the first to 1) enable NDP without core HW/SW changes, 2) overlap the execution of dependent layers and 3) offload both memory- and communication-bound operations from GPUs in DNN training. Through deep learning compiler support, NDP kernels can be generated automatically without any model code modification. The mtNDP can improve training throughput by up to 2.83× and reduce energy by up to 41.4%.
URI
http://postech.dcollection.net/common/orgView/200000660623
https://oasis.postech.ac.kr/handle/2014.oak/118361
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse