Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Learning to Generate Visual Scene and Text Description from Sound

Title
Learning to Generate Visual Scene and Text Description from Sound
Authors
김성빈
Date Issued
2023
Publisher
포항공과대학교
Abstract
This thesis proposes methods for cross-modal generation, which targets generating an image or text from input sound. As the direct end-to-end training for cross-modal generation restricts learning expressive generative models due to the modality gap, I formulate the problems in a compositional module-based manner. I approach each task by breaking it down into sub-tasks, by first training a strong single-modal image or text generation model on a large-scale dataset and then learning to link sound with the pre-trained generation model. With this approach, I deal with both challenging problems of cross-modal generation, sound-to-image and sound-to-text. First, I propose a method to visualize the visual semantics embedded in sound by learning the relationship between the naturally co-occurring audio-visual signals in the video. I train a conditional generative adversarial network to generate images from pre-trained visual features from the image encoder and then enrich the audio features with visual knowledge by learning to align audio to visual latent space in a self-supervised way. Furthermore, a highly correlated audio-visual pair selection method is incorporated to stabilize the training. As a result, the proposed method synthesizes substantially better image quality from a large number of in-the-wild sound categories compared to the prior arts of sound-to-image generation. Next, I tackle the task of automated audio captioning, which aims to generate text descriptions from environmental sounds. I propose to leverage a pre-trained large-scale language model for text generation and keep it frozen to maintain its expressiveness. Then, I train an audio encoder to extract global and temporal features from the input audio. To bridge the modality gap between the audio features and the language model, I design mapping networks that translate audio features to the continuous vectors the language model can understand. The proposed method shows better generalization and expressiveness than the prior arts in diverse experimental settings on benchmark datasets.
URI
http://postech.dcollection.net/common/orgView/200000660112
https://oasis.postech.ac.kr/handle/2014.oak/118241
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse