Releasing AI-driven tools for understanding AI systems

submited by

Style Pass

2024-10-24 04:30:02

We are releasing several pieces of technology using trained AI agents to help users understand and investigate other AI systems.

AI systems result from pipelines of complex data, from training sets to learned representations and model behaviors. Our vision is to create AI-driven tools that direct massive computational power toward explaining these complex systems.

You can click above to read detailed research reports, or continue reading for overviews. We’re releasing a public endpoint for the interface today and will fully open-source the code for the feature description and observability tools next week.

Feature Descriptions. We often understand representations by decomposing them into features, such as neurons or linear directions. To understand these features, we want human-readable descriptions that accurately predict when each feature is active.

There are far too many features to label by hand, so we need to automate this. We trained an AI pipeline to generate, score, and re-rank descriptions, outperforming human descriptions on our quality metrics and significantly outperforming existing automated methods. We have generated high-quality descriptions for every neuron in Llama-3.1 8B and are releasing these descriptions, along with an open-source LLM that can be applied to any features in any language model, in a code release next week.