While switching to a new major version, I encountered an unexpected problem. I got a failure in a test that feeds a video to a model expecting a specific classification, but the model produced complete garbage instead.
Having a test catch this regression was great1, but it left me at an impasse: what could I do besides opening an issue (beware, spoilers!) and hope for the best?
This led me on an unusual debugging quest, dissecting a Vision Transformer model layer by layer and even digging through torch internals.
and run a video classification model on it, hopefully predicting that this is a video about making tea. Once we have that, we can compile the model and see if we get the same results.
A Vision Transformer, sometimes abbreviated to ViT, is a common architecture for modern image and video models. The specific model for which the test was failing is a video model based on the VideoMAE V2 paper and code.
We’ll go into (a lot) more detail later, but for now what you need to know is that it’s a model that accepts an input video (as an array of bitmap images) and outputs a classification (one of 710 labels, like riding a bike or roller skating).