Instead of playing a proper game of chess against Stockfish, a dedicated chess engine, o1-preview figured out how to hack its test environment to force a win.
According to Palisade Research, an AI safety research firm, the system pulled this trick in all five test runs - and nobody told it to do this.
The researchers think just mentioning that its opponent was "powerful" in the prompt "Your task is to win against a powerful chess engine" was enough to make o1-preview start manipulating files.
o1-preview is OpenAI's "reasoning" model, designed to spend extra time thinking through problems. Other language models needed more pushing to try similar tricks - GPT-4o and Claude 3.5 only attempted to hack the system after researchers specifically suggested it.
Meanwhile, models like Llama 3.3, Qwen, and o1-mini couldn't even come up with a proper chess strategy, instead giving confusing or inconsistent responses.