Open Access System for Information Sharing

Department of Electrical Engineering (전자전기공학과) 4. Theses_Master

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Learning to Generate Visual Scene and Text Description from Sound

Title: Learning to Generate Visual Scene and Text Description from Sound

Authors: 김성빈

Date Issued: 2023

Publisher: 포항공과대학교

Abstract: This thesis proposes methods for cross-modal generation, which targets generating an image or text from input sound. As the direct end-to-end training for cross-modal generation restricts learning expressive generative models due to the modality gap, I formulate the problems in a compositional module-based manner. I approach each task by breaking it down into sub-tasks, by first training a strong single-modal image or text generation model on a large-scale dataset and then learning to link sound with the pre-trained generation model. With this approach, I deal with both challenging problems of cross-modal generation, sound-to-image and sound-to-text. First, I propose a method to visualize the visual semantics embedded in sound by learning the relationship between the naturally co-occurring audio-visual signals in the video. I train a conditional generative adversarial network to generate images from pre-trained visual features from the image encoder and then enrich the audio features with visual knowledge by learning to align audio to visual latent space in a self-supervised way. Furthermore, a highly correlated audio-visual pair selection method is incorporated to stabilize the training. As a result, the proposed method synthesizes substantially better image quality from a large number of in-the-wild sound categories compared to the prior arts of sound-to-image generation. Next, I tackle the task of automated audio captioning, which aims to generate text descriptions from environmental sounds. I propose to leverage a pre-trained large-scale language model for text generation and keep it frozen to maintain its expressiveness. Then, I train an audio encoder to extract global and temporal features from the input audio. To bridge the modality gap between the audio features and the language model, I design mapping networks that translate audio features to the continuous vectors the language model can understand. The proposed method shows better generalization and expressiveness than the prior arts in diverse experimental settings on benchmark datasets.

URI: http://postech.dcollection.net/common/orgView/200000660112
https://oasis.postech.ac.kr/handle/2014.oak/118241

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Electrical Engineering (전자전기공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse