Do multimodal neural networks better explain human visual representations than vision-only networks?
Bhavin Choksi, Rufin VanRullen, Leila Reddy, Centre national de la recherche scientifique, France
Posters 1 Poster
Pacific Ballroom H-O
Thu, 25 Aug, 19:30 - 21:30 Pacific Time (UTC -8)
Multiple studies have used the representations learned by modern Artificial Neural Networks (ANNs) to explain the activity of the human brain. Efforts up until now have looked at the differences in explainability due to different architectures (such as recurrence vs feedforward or convolution-based vs transformer-based) or different objective functions (supervised vs unsupervised). Here, using multiple uni- and multimodal networks from the literature, we look at another key factor -- the modality of the training inputs. Moreover, instead of looking at specific regions of interest or restricting our analysis to the visual ventral stream, we perform correlation- and partial correlation-based searchlight analyses to look at the whole brain. We report that multimodal networks are more similar than their visual counterparts to human fMRI activity in visual regions, and also stand out in their unique ability to explain higher order regions around the superior temporal sulcus (STS).