Image Embeddings Informed by Natural Language Significantly Improve Predictions and Understanding of Human Higher-level Visual Cortex
Aria Wang, Michael Tarr, Leila Wehbe, Carnegie Mellon University, United States
Posters 3 Poster
Pacific Ballroom H-O
Sat, 27 Aug, 19:30 - 21:30 Pacific Time (UTC -8)
Does the organization of high-level visual knowledge reflect multimodal constraints? We extracted features from images using the Contrastive Language-Image Pre-training (CLIP) model, which encodes visual concepts with supervision from natural language captions. We then predicted brain responses to real-world images from the Natural Scenes Dataset using voxelwise encoding models based on CLIP features, and the model explains up to 0.75\% of variance in held out test data. CLIP also explains greater unique variance in higher-level visual areas as compared to models trained only with images (ImageNet trained ResNet) or text (BERT). Visualizations of model embeddings and Principal Component Analysis (PCA) suggest that CLIP captures important semantic dimensions of visual knowledge. Multimodal models trained jointly on images and text (and perhaps other modalities) appear to be better models of representation within human higher-level visual cortex.