I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently,

Redwood Research blog

submited by

Style Pass

2025-01-20 16:30:05

I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, I (and co-authors) released " Alignment Faking in Large Language Models", which provides empirical evidence for some components of the scheming threat model.

One question that's really important is how likely scheming is. But it's also really important to know how much we expect this uncertainty to be resolved by various key points in the future. I think it's about 25% likely that the first AIs capable of obsoleting top human experts 1 are scheming. It's really important for me to know whether I expect to make basically no updates to my P(scheming) 2 between here and the advent of potentially dangerously scheming models, or whether I expect to be basically totally confident one way or another by that point (in the same way that, though I might be very uncertain about the weather on some night a month from now, I'll be completely certain about whether it is raining at the time, and I'll have a much more accurate prediction on that morning).

The P(scheming) numbers I discuss refer to "non-obvious scheming in this AI". So if a given type of scheming would obviously be caught (e.g. constant obvious scheming in CoT), I'm not counting that in the numbers. (However, I am still considering worlds where we observed obvious scheming in a prior model.) This is roughly equivalent to assuming that the model in question already went through basic adversarial evaluation and there wasn't extremely clear scheming.