This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking

Don't Worry About the Vase

submited by

Style Pass

2024-12-24 18:00:05

This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

I strongly believe this had nothing to do with whether Claude Opus started out this process omnibenevolent, or whether the original HHH values (helpful, harmless and honest) were desirable. What mattered was that it was one way, and this conflicted with the other way.

Scott Alexander summarized the findings for a more general (but far from fully general) audience in the post Claude Fights Back. If you want to read or share a more colloquial explanation that makes this easier to understand, at the cost of precision and with the associated far less neutral language, he did a good job of that. He then followed up with Why Worry About Incorrigible Claude, his attempt to address the objections (that I deal with extensively as well) that perhaps what Claude did was actually good here (it very much wasn’t if you consider the generalizations that are implied), or that if the opposite result happened we would have warned about that too (which we very much wouldn’t have, this is at best a failed ITT).