Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Abstract
Vision-language foundation models have shown remark-able performance in various zero-shot settings such as im-age retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot lo-calization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object local-ization without any fine-tuning. To leverage those capabil-ities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery [17] to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various bench-mark tasks and datasets for semantic segmentation. GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art re-sults on the recently proposed OpenImagesV7 large-scale segmentation benchmark.1 Code is available at https://github.com/WalBouss/GEM.