Open Access System for Information Sharing

Login Library

Department of Electrical Engineering (전자전기공학과) 4. Theses_Master

Thesis

Cited 0 time in webofscience

webofscience

Cited 0 time in scopus

scopus

Metadata Downloads

Full metadata record

Files in This Item:: There are no files associated with this item.

DC Field	Value	Language
dc.contributor.author	김성빈	-
dc.date.accessioned	2023-08-31T16:31:05Z	-
dc.date.available	2023-08-31T16:31:05Z	-
dc.date.issued	2023	-
dc.identifier.other	OAK-2015-10044	-
dc.identifier.uri	http://postech.dcollection.net/common/orgView/200000660112	ko_KR
dc.identifier.uri	https://oasis.postech.ac.kr/handle/2014.oak/118241	-
dc.description	Master	-
dc.description.abstract	This thesis proposes methods for cross-modal generation, which targets generating an image or text from input sound. As the direct end-to-end training for cross-modal generation restricts learning expressive generative models due to the modality gap, I formulate the problems in a compositional module-based manner. I approach each task by breaking it down into sub-tasks, by first training a strong single-modal image or text generation model on a large-scale dataset and then learning to link sound with the pre-trained generation model. With this approach, I deal with both challenging problems of cross-modal generation, sound-to-image and sound-to-text. First, I propose a method to visualize the visual semantics embedded in sound by learning the relationship between the naturally co-occurring audio-visual signals in the video. I train a conditional generative adversarial network to generate images from pre-trained visual features from the image encoder and then enrich the audio features with visual knowledge by learning to align audio to visual latent space in a self-supervised way. Furthermore, a highly correlated audio-visual pair selection method is incorporated to stabilize the training. As a result, the proposed method synthesizes substantially better image quality from a large number of in-the-wild sound categories compared to the prior arts of sound-to-image generation. Next, I tackle the task of automated audio captioning, which aims to generate text descriptions from environmental sounds. I propose to leverage a pre-trained large-scale language model for text generation and keep it frozen to maintain its expressiveness. Then, I train an audio encoder to extract global and temporal features from the input audio. To bridge the modality gap between the audio features and the language model, I design mapping networks that translate audio features to the continuous vectors the language model can understand. The proposed method shows better generalization and expressiveness than the prior arts in diverse experimental settings on benchmark datasets.	-
dc.language	eng	-
dc.publisher	포항공과대학교	-
dc.title	Learning to Generate Visual Scene and Text Description from Sound	-
dc.title.alternative	소리에 내재된 시각 및 텍스트 정보 복원 방법에 관한 연구	-
dc.type	Thesis	-
dc.contributor.college	전자전기공학과	-
dc.date.degree	2023- 2	-

Show simple item record

qr_code

트윗하기

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Communities & Collection

Department of Electrical Engineering (전자전기공학과)

Views & Downloads

OAK

개인정보처리방침 Personal Information Protection Policy

library@postech.ac.kr Tel: 054-279-2548

Copyrights © by 2017 Pohang University of Science ad Technology All right reserved.

Browse

Login Library Help