minViT: Walkthrough of a minimal Vision Transformer (ViT)

submited by
Style Pass
2024-05-09 13:30:02

In this post, I explain the vision transformer (ViT) architecture, which has found its way into computer vision as a powerful alternative to Convolutional Neural Networks (CNNs).

This implementation will focus on classifying the CIFAR-10 dataset, but is adaptable to many tasks, including semantic segmentation, instance segmentation, and image generation. As we will see, training small ViT models is difficult, and the notebook on fine-tuning (later in this post) explains how to get around these issues.

The images are represented as 3 channel (RGB) 32x32 pixel images. The dataset can be indexed, with the first index being the image index, and the second index indexing either the image data or the target. The pixel values are represented as torch.float32 values from 0 to 1.

If you are familiar with the transformer architecture, you likely know that transformers work with vectors to model different modalities. For a text-based modality, this means somehow tokenizing a string of text into characters or larger chunks, and training an embedding table to represent each token as a vector. We hope that tokenization results in semantic units, so that each vector may represent a concept with a specific meaning. As an example, the string “This is a test.” may tokenize as follows:

Leave a Comment