Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 4. Theses_Master

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Efficient Scheduling and Code Generation for DL Model Training on Near-Data Processing Memory

Title: Efficient Scheduling and Code Generation for DL Model Training on Near-Data Processing Memory

Authors: 박주언

Date Issued: 2023

Publisher: 포항공과대학교

Abstract: As training deep neural network faces the memory bottleneck problem, solutions such as near-data processing or processing in memory are emerging. These solutions resolve the memory bottleneck problem in deep neural network training by computing partial workloads near or in memory. Using accelerators with this kind of solution requires software support that decides which workload to offload to the accelerator or not. The problem is that little has been considered regarding the exploitation of the accelerator. In this paper, I propose prior approaches for offloading workloads on the accelerator. Then, I propose a method that exploits an NDP architecture by offloading partial workloads to an accelerator and executing it in parallel with GPU. Also, I propose an end-to-end solution for supporting an NDP architecture which consists of frontend and backend. I extend TensorFlow XLA version 2.4 to implement these components. I achieve up to 1.5x speedup in the LSTM sequence-to-sequence model.
본 논문은 근접 데이터를 처리하는 메모리 가속기에 대해서 자원을 분배하고, 분배한 자원에 대해서 하드웨어-특정적인 최적화를 진행한다. 이 논문에서는 우선적으로 기존에 있던 해법들에 대해서 되짚어본다. 이후에 과거 해법들에 대해서 직접 구현을 진행하며 사례 연구를 진행한다. 진행한 사례 연구를 통해서 연산강도를 기반으로 한 오프로드 기법은 한계가 있음을 알 수 있었으며, GPU의 성능을 위해서는 GPU에 대한 합병 최적화가 일어난 후에 오프로드를 진행해야 함을 알 수 있었다. 이 사례 연구를 기반으로, 본 연구에서는 작업 오프로딩과 작업 병렬화부터 코드 생성을 하는 컴파일러인 XLA-NDP를 제안한다. 이 컴파일러는 우선 오프로딩을 했을 때의 효율을 기준으로 오프로딩을 결정한다. 이후에 병렬적으로 오프로딩 된 작업을 GPU 작업과 실행하는 문제를 0-1 배낭 채우기 문제로 치환해서 해결한다. 오프로딩과 작업 병렬화가 다 끝난 후, 오프로딩된 작업에 대해서 이 컴파일러는 자동으로 코드 생성을 진행한다. 첫 단계에서는 코드 탬플릿 기반의 코드 생성을 진행한다. 이후 코드 탬플릿들이 합쳐져서 만들어진 코드에 대해서 메모리 최적화와 밸류 넘버링 최적화를 진행을 한다. 마지막으로 레지스터 할당을 하면서 NDPX에 대한 코드 생성을 종료한다. XLA-NDP는 평균적으로 약 1.35배의 속도 향상을 얻고, LSTM과 BERT 모델에 대해서는 최대 1.5배까지 성능 향상을 보인다. NDPX 성능 예측 모델은 0.94의 R2 점수를 가진다. 코드 최적화는 전체 코드에서 40%를 줄이며, 속도 향상은 50%까지 이끌어낸다. 미래에는 해당 컴파일러가 다중 GPU, 다중 NDPX를 사용할 수 있는 솔루션을 만들고자 한다.

URI: http://postech.dcollection.net/common/orgView/200000662190
https://oasis.postech.ac.kr/handle/2014.oak/118287

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse