A PyTorch extension implementing symmetric power transformers - a variant of linear transformers that achieves transformer-level performance while scaling linearly with sequence length. This package provides efficient CUDA kernels that make it possible to process much longer sequences compared to standard quadratic attention.
The package includes a drop-in replacement for standard attention in transformer models. See train/model.py for a complete example of using power attention in a GPT-style model: