Grounding Everything: Emerging Localization Properties in Vision-Language TransformersWalid BousselhamFelix Petersenet al.2024CVPR 2024