CLIPCAM: A Simple Baseline For Zero-Shot Text-Guided Object And Action Localization

  • Hsuan-An Hsia
  • , Che-Hsien Lin
  • , Bo-Han Kung
  • , Jhao-Ting Chen
  • , D.S. Tan
  • , Jun-Cheng Chen
  • , Kai-Lung Hua

    Research output: Chapter in Book/Report/Conference proceedingConference Article in proceedingAcademicpeer-review

    Abstract

    The key for the contemporary deep learning-based object and action localization algorithms to work is the large-scale annotated data. However, in real-world scenarios, since there are infinite amounts of unlabeled data beyond the categories of publicly available datasets, it is not only time- and manpower-consuming to annotate all the data but also requires a lot of computational resources to train the detectors. To address these issues, we show a simple and reliable baseline that can be easily obtained and work directly for the zero-shot text-guided object and action localization tasks without introducing additional training costs by using Grad-CAM, the widely used class visual saliency map generator, with the help of the recently released Contrastive Language-Image Pre-Training (CLIP) model by OpenAI, which is trained contrastively using the dataset of 400 million image-sentence pairs with rich cross-modal information between text semantics and image appearances. With extensive experiments on the Open Images and HICO-DET datasets, the results demonstrate the effectiveness of the proposed approach for the text-guided unseen object and action localization tasks for images.
    Original languageEnglish
    Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
    PublisherIEEE
    Pages4453-4457
    Number of pages5
    ISBN (Electronic)9781665405409
    ISBN (Print)9781665405416
    DOIs
    Publication statusPublished - May 2022
    EventIEEE International Conference on Acoustics, Speech and Signal Processing - Singapore, Singapore
    Duration: 23 May 202227 May 2022
    https://ieeexplore.ieee.org/xpl/conhome/9745891/proceeding

    Conference

    ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing
    Abbreviated titleICASSP 2022
    Country/TerritorySingapore
    CitySingapore
    Period23/05/2227/05/22
    Internet address

    Keywords

    • CAM
    • CLIP
    • localization
    • text-guided
    • zero-shot

    Fingerprint

    Dive into the research topics of 'CLIPCAM: A Simple Baseline For Zero-Shot Text-Guided Object And Action Localization'. Together they form a unique fingerprint.

    Cite this