Open Access System for Information Sharing

Login Library

Department of Electrical Engineering (전자전기공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

webofscience

Cited 0 time in scopus

scopus

Metadata Downloads

Full metadata record

Files in This Item:: There are no files associated with this item.

DC Field	Value	Language
dc.contributor.author	변영훈	-
dc.date.accessioned	2024-08-23T16:32:55Z	-
dc.date.available	2024-08-23T16:32:55Z	-
dc.date.issued	2024	-
dc.identifier.other	OAK-2015-10628	-
dc.identifier.uri	http://postech.dcollection.net/common/orgView/200000805734	ko_KR
dc.identifier.uri	https://oasis.postech.ac.kr/handle/2014.oak/124018	-
dc.description	Doctor	-
dc.description.abstract	Algorithm-hardware co-design is becoming increasingly important as we aim to deploy transformers demonstrating scale-proportional performance across diverse domains effectively. The ever-increasing demand for deploying complex neural network models in practical applications necessitates the development of memory-efficient solutions that maximize memory bandwidth utilization even for compressed models. In the first part of this thesis, we deeply investigated memory interface overheads resulting from irregular data accessing patterns, which are prevalent in pruned DNN models. Leveraging the state-of-the-art XOR-gate compression, we introduce a sparsity-aware memory interface architecture and the innovative stacked XORNet solution. These advancements significantly reduce data imbalances and interface costs while maintaining high-speed pruned-DNN inference capabilities. Our experimental results showed that the proposed algorithm-hardware co-design can boost effective bandwidth with reasonable hardware costs. In the second part of this thesis, we extend our investigation from fine-grained pruning to partially structured pruning, which drastically reduces the local sparsity fluctuation. Although the previous stacked XORNet compression method reduced the local sparsity fluctuation, the hardware overhead from XOR-gate compression error is hard to ignore. Therefore, we propose a Patch-Limited XOR-gate compression, Partially-Structured Transformer pruning, and Bit-wise Patch Reduction techniques tailored for XOR-gate compression. These methods reduce the required patches, simplifying the decompressor architecture and minimizing correction efforts. The introduced systems successfully reduced the number of errors and normalized error distribution, achieving 23% higher effective bandwidth than the previously introduced State-of-the-art work. Our research underscores the significance of memory interface optimization for efficiently deploying pruned neural network models. Through comprehensive investigations and innovative solutions, this thesis contributes to the field by providing cost-efficient, high-speed memory interface architectures that bridge the gap between advanced model compression techniques and hardware implementation. These findings have profound implications for future computing systems, enabling the seamless integration of complex neural networks in practical applications.	-
dc.language	eng	-
dc.publisher	포항공과대학교	-
dc.title	Towards Efficient Neural Network Inference with Model Compression	-
dc.type	Thesis	-
dc.contributor.college	전자전기공학과	-
dc.date.degree	2024- 8	-

Show simple item record

qr_code

트윗하기

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Communities & Collection

Department of Electrical Engineering (전자전기공학과)

Views & Downloads

OAK

개인정보처리방침 Personal Information Protection Policy

library@postech.ac.kr Tel: 054-279-2548

Copyrights © by 2017 Pohang University of Science ad Technology All right reserved.

Browse

Login Library Help