Mapping the latent space of Llama 3.3 70B

submited by

Style Pass

2024-12-23 17:30:05

We have trained sparse autoencoders (SAEs) on Llama 3.3 70B and released the interpreted model for general access via an API. To our knowledge, this is the most capable openly available model with interpretability tooling. We think that making interpretability tools easily available on a powerful model will enable both new research and new products.

This post explores the feature space of Llama 3.3-70B at an intermediate layer - you can browse an interactive map of features that you can then use in the API, and we also demo the steering effects of some of our favorite features.

We have also introduced a range of new features that make SAE-based steering much easier to use and more reliable. You can learn how to use them in our API docs and experiment with them in our playground. We’ll be releasing a research post covering our improvements in steering methodology in the new year.

We used DataMapPlot to create an interactive UMAP visualization of our SAE features. This allows you to explore the latents that are available to use in steering and classification in our API.