Seminar: Vision-Language Models Don’t Generalise Across Modalities

When: 19th of March, 2:00pm AEDT

Where: This seminar will be partially presented at the ACFR seminar area, J04 lvl 2 (Rose St Building) and partially online via Zoom. RSVP

Speaker: Yonatan Gideoni

Title: Vision-Language Models Don’t Generalise Across Modalities

Abstract:
In spite of their widespread impressive performance, vision-language models (VLMs) still exhibit some simple failures. In this talk I will demonstrate such a failure, where VLMs can fail to recognize some simple everyday concepts, and show how it results from misalignment between the vision and language representations the VLM relies on. This misalignment is due to the data and training paradigms, thereby persisting regardless of scale. I will discuss why this and other failures happen and implications for other multimodal systems, such as vision-language-action models.

Bio:

Yonatan is a DPhil student at Oxford developing fundamental methods in machine learning. His research investigates the limits of existing learning paradigms, aiming to understand where they break down and how to design methods that go beyond them. His recent work explored these questions in multimodality and code generation. Previously, Yonatan received his master’s in computer science from the University of Cambridge and worked on maps for autonomous vehicles at Mobileye. His PhD is funded by the AIMS CDT and a Rhodes Scholarship.

13 March 2026