Sequential object-based attention for robust visual reasoning
Hossein Adeli, Seoyoung Ahn, Gregory Zelinsky, Stony Brook University, United States
Posters 3 Poster
Pacific Ballroom H-O
Sat, 27 Aug, 19:30 - 21:30 Pacific Time (UTC -7)
The visual system uses sequences of selective glimpses to objects to support goal-directed behavior, but how is this attention control learned? Here we present an encoder-decoder model inspired by the interacting bottom-up and top-down visual pathways in the brain. At every iteration, a new glimpse is taken from the image and is processed through the "what" encoder, a hierarchy of feedforward, recurrent, and capsule layers, to obtain an object-centric representation. This representation feeds to the "where" decoder, where the evolving recurrent representation provides top-down attentional modulation to plan subsequent glimpses and impact routing in the encoder. In a visual reasoning task requiring comparison of two objects, our model achieves near-perfect accuracy and significantly outperforms larger models in generalizing to out-of-distribution stimuli. Our work demonstrates the benefits of object-based attention mechanisms taking sequential glimpses of objects.