Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads
Full metadata record
Files in This Item:
There are no files associated with this item.
DC FieldValueLanguage
dc.contributor.author김성빈-
dc.date.accessioned2023-08-31T16:31:05Z-
dc.date.available2023-08-31T16:31:05Z-
dc.date.issued2023-
dc.identifier.otherOAK-2015-10044-
dc.identifier.urihttp://postech.dcollection.net/common/orgView/200000660112ko_KR
dc.identifier.urihttps://oasis.postech.ac.kr/handle/2014.oak/118241-
dc.descriptionMaster-
dc.description.abstractThis thesis proposes methods for cross-modal generation, which targets generating an image or text from input sound. As the direct end-to-end training for cross-modal generation restricts learning expressive generative models due to the modality gap, I formulate the problems in a compositional module-based manner. I approach each task by breaking it down into sub-tasks, by first training a strong single-modal image or text generation model on a large-scale dataset and then learning to link sound with the pre-trained generation model. With this approach, I deal with both challenging problems of cross-modal generation, sound-to-image and sound-to-text. First, I propose a method to visualize the visual semantics embedded in sound by learning the relationship between the naturally co-occurring audio-visual signals in the video. I train a conditional generative adversarial network to generate images from pre-trained visual features from the image encoder and then enrich the audio features with visual knowledge by learning to align audio to visual latent space in a self-supervised way. Furthermore, a highly correlated audio-visual pair selection method is incorporated to stabilize the training. As a result, the proposed method synthesizes substantially better image quality from a large number of in-the-wild sound categories compared to the prior arts of sound-to-image generation. Next, I tackle the task of automated audio captioning, which aims to generate text descriptions from environmental sounds. I propose to leverage a pre-trained large-scale language model for text generation and keep it frozen to maintain its expressiveness. Then, I train an audio encoder to extract global and temporal features from the input audio. To bridge the modality gap between the audio features and the language model, I design mapping networks that translate audio features to the continuous vectors the language model can understand. The proposed method shows better generalization and expressiveness than the prior arts in diverse experimental settings on benchmark datasets.-
dc.languageeng-
dc.publisher포항공과대학교-
dc.titleLearning to Generate Visual Scene and Text Description from Sound-
dc.title.alternative소리에 내재된 시각 및 텍스트 정보 복원 방법에 관한 연구-
dc.typeThesis-
dc.contributor.college전자전기공학과-
dc.date.degree2023- 2-

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse