Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 4. Theses_Master

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Memory Access-Triggered Near-Data Processing for Accelerating DNN Training on GPUs

Title: Memory Access-Triggered Near-Data Processing for Accelerating DNN Training on GPUs

Authors: 조현욱

Date Issued: 2023

Publisher: 포항공과대학교

Abstract: In training DNNs, memory/communication-bound operations can account for a significant portion of runtime due to limited the off-chip bandwidth (BW) of GPUs. To address the challenge, I propose a novel memory access-triggered near-data processing (mtNDP) architecture. With mtNDP, normal memory accesses also serve as implicit NDP requests to enable NDP without any changes in the core ISA/microarchitecture, core-side SW, or memory protocol, overcoming the practicality limitations of prior approaches. In addition, mtNDP enables on-the-fly NDP where the data already supplied in normal memory access packets for compute-bound operations are also simultaneously used for NDP: thus, mtNDP can reduce memory traffic. Moreover, by overlapping NDP kernels with compute-bound kernels, memory BW underutilized by GPU cores can be used by mtNDP units to improve performance, even if total memory BW is not increased. The mtNDP units can be deployed to heterogeneous memory devices in a system. First, I deploy them near GPU’s memory controllers. With on-the-fly mtNDP, compute-bound kernels can be overlapped with memory-bound kernels, even if they have dependencies, to achieve significant speedups. Secondly, my NDP units can be deployed in memory expanders that are connected to multiple GPUs to create an NDP-enabled memory eXpander Network (NDPXNet). It can entirely offload gradient reduction and the optimizer in data-parallel training, achieving additional speedups while eliminating redundancy in memory usage and optimizer execution. To the best of my knowledge, this work is the first to 1) enable NDP without core HW/SW changes, 2) overlap the execution of dependent layers and 3) offload both memory- and communication-bound operations from GPUs in DNN training. Through deep learning compiler support, NDP kernels can be generated automatically without any model code modification. The mtNDP can improve training throughput by up to 2.83× and reduce energy by up to 41.4%.

URI: http://postech.dcollection.net/common/orgView/200000660623
https://oasis.postech.ac.kr/handle/2014.oak/118361

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse