In a recent set of safety evaluations, OpenAI’s o1 model, believing that it faced imminent shutdown, attempted to escape.
Apollo Research, an AI safety organisation who conducted third-party testing of o1 and detected the attempt in question, were quick to provide context—escape attempts only occurred in a small percentage of cases where the model was strongly prompted to achieve its goal at all costs, and the evaluations do not demonstrate that o1 itself is advanced enough to actually succeed at escaping into the wild.
Taken in the wider context of AI development, however, these caveats provide cold comfort. First, it seems inevitable that as more and more powerful AIs are released to the public, some users will prompt models in precisely the way that triggered o1’s escape attempt. Second, that o1 is not powerful enough to escape provides no such guarantee about future models, which are improving at a blistering pace. And third, there are actually very strong theoretical arguments which explain o1’s behaviour. Indeed, experts have been predicting such scenarios for many years. This provides evidence that we should expect similar incidents in the future.
A common rebuke to AI fears is: “Can’t we just switch it off?”. Just last week, former Google CEO Eric Schmitt warned AI systems may soon become dangerous enough that we will “seriously need to think about unplugging [them]”. But if AIs develop the ability to resist human shutdown, this may not be a viable line of defence.