I'm writing this post to clarify my thoughts and update my collaborators on multimodal interpretability in 2024. Having spent part of the summer in the AI safety sphere in Berkeley, and then joining the video understanding team at FAIR as a visiting researcher, I'm bridging two communities: the language mechanistic interpretability efforts in AI safety, and the efficiency-focused Vision-Language Model (VLM) community in industry. Some content may be more familiar to one community than the other.
As part of a broader series, this post is a progress update on my thinking around multimodal interpretability. It is a snapshot of a rapidly evolving field, synthesizing my opinions, my preliminary research, and recent literature.
This post is selective, not exhaustive, and omits many significant papers. I've focused on works that particularly resonate with my current thinking. Nor does this post fully represent my broader research agenda, but rather frames a few directions that I believe to be significant.
This post emphasizes mechanistic and causal interpretability, in contrast to "traditional" interpretability methods such as saliency maps, visualizations, and input-based techniques. We'll not focus on DeepDream-style approaches or Shapley scores. Instead of focusing on these feature visualizations, or on input data modifications common in mainstream interpretability, we'll concentrate on weight space analysis—examining how changes to the model's internal activations affect its behavior. Our goal is to develop a scientific, causal, and algorithmic understanding of the model by mapping its internal components to its behavior.