FAST: Efficient Robot Action Tokenization

submited by
Style Pass
2025-01-16 19:00:07

Most foundation models use the Transformer architecture, a sequence model that operates on discrete "tokens," which typically correspond to groups of letters or short words, image patches, or sound snippets. Transformers map input tokens (e.g., a question) to output tokens (e.g,. an answer), and any data that we can tokenize into discrete units can be processed by such a sequence model. However, the choice of tokenization can have a big impact on the effectiveness of downstream learning, and using a good tokenizer is essential for effective large-scale training.

So what should we do if we want to train Transformers to control robots? In this case, the output is an "action chunk," a short sequence of robot actions (e.g., arm joint angles), which might range from 3-5 actions for crude systems all the way to 20-50 actions for high-frequency dexterous robots. Just like with language, representing these actions in the right way is essential for effective learning. Existing vision-language-action (VLA) models typically use simple discrete binning, where each dimension of each action step is represented with a discrete bin. This is passable for simple behaviors, but rapidly breaks down for more complex and dexterous skills that require precision and high-frequency control. As we will discuss in this post, this kind of binning technique simply fails to solve the kinds of complex, dexterous tasks that we are interested in at Physical Intelligence. Diffusion or flow matching tends to perform much better, as in the case of our π0 model. But diffusion takes much longer to train. So how can we represent actions to be able to train Transformers for robotic control quickly while preserving dexterity and precision?

Our new action tokenizer, FAST, enables us to train generalist policies on highly dexterous tasks via simple next token prediction.

Leave a Comment