Abstract: In this talk, I will discuss recent publications from my group that attempt at learning models of the world and the effect of the actions of an agent within that world self-supervised, solely via interaction. In particular, I will discuss the potential and challenges of video generative models as a candidate for such a world model, the role of inductive biases using our recent work that discovers the kinematics of a robot as an example, and finally a new research direction in which we attempt to discover the physical rules underlying our world without any inductive biases whatsoever.
Bio: Vincent Sitzmann is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he did his Ph.D. at Stanford University as well as a Postdoc at MIT CSAIL. His research interest lies in building models that perceive and model the world the way that humans do. Specifically, Vincent works towards models that can learn to reconstruct a rich state description of their environment, such as reconstructing its 3D structure, materials, semantics, etc. from vision. More importantly, these models should then also be able to model the impact of their own actions on that environment, i.e., learn a “mental simulator” or “world model”. Vincent is particularly interested in models that can learn these skills fully self-supervised only from video and by self-directed interaction with the world.