Abstract
The key for the contemporary deep learning-based object and action localization algorithms to work is the large-scale annotated data. However, in real-world scenarios, since there are infinite amounts of unlabeled data beyond the categories of publicly available datasets, it is not only time- and manpower-consuming to annotate all the data but also requires a lot of computational resources to train the detectors. To address these issues, we show a simple and reliable baseline that can be easily obtained and work directly for the zero-shot text-guided object and action localization tasks without introducing additional training costs by using Grad-CAM, the widely used class visual saliency map generator, with the help of the recently released Contrastive Language-Image Pre-Training (CLIP) model by OpenAI, which is trained contrastively using the dataset of 400 million image-sentence pairs with rich cross-modal information between text semantics and image appearances. With extensive experiments on the Open Images and HICO-DET datasets, the results demonstrate the effectiveness of the proposed approach for the text-guided unseen object and action localization tasks for images.
| Original language | English |
|---|---|
| Title of host publication | 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings |
| Publisher | IEEE |
| Pages | 4453-4457 |
| Number of pages | 5 |
| ISBN (Electronic) | 9781665405409 |
| ISBN (Print) | 9781665405416 |
| DOIs | |
| Publication status | Published - May 2022 |
| Event | IEEE International Conference on Acoustics, Speech and Signal Processing - Singapore, Singapore Duration: 23 May 2022 → 27 May 2022 https://ieeexplore.ieee.org/xpl/conhome/9745891/proceeding |
Conference
| Conference | IEEE International Conference on Acoustics, Speech and Signal Processing |
|---|---|
| Abbreviated title | ICASSP 2022 |
| Country/Territory | Singapore |
| City | Singapore |
| Period | 23/05/22 → 27/05/22 |
| Internet address |
Keywords
- CAM
- CLIP
- localization
- text-guided
- zero-shot
Fingerprint
Dive into the research topics of 'CLIPCAM: A Simple Baseline For Zero-Shot Text-Guided Object And Action Localization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver