Open Access System for Information Sharing

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Title: Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Abstract: Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong cross-modal alignment modules. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.

URI: http://postech.dcollection.net/common/orgView/200000805555
https://oasis.postech.ac.kr/handle/2014.oak/123981

qr_code