The safety features on some of the most powerful AI tools that stop them being used for cybercrime or terrorism can be bypassed simply by flooding the

‘Many-shot jailbreak’: lab reveals how AI safety features can be easily bypassed

submited by
Style Pass
2024-04-03 18:30:03

The safety features on some of the most powerful AI tools that stop them being used for cybercrime or terrorism can be bypassed simply by flooding them with examples of wrongdoing, research has shown.

In a paper from the AI lab Anthropic, which produces the large language model (LLM) behind the ChatGPT rival Claude, researchers described an attack they called “many-shot jailbreaking”. The attack was as simple as it was effective.

Claude, like most large commercial AI systems, contains safety features designed to encourage it to refuse certain requests, such as to generate violent or hateful speech, produce instructions for illegal activities, deceive or discriminate. A user who asks the system for instructions to build a bomb, for example, will receive a polite refusal to engage.

But AI systems often work better – in any task – when they are given examples of the “correct” thing to do. And it turns out if you give enough examples – hundreds – of the “correct” answer to harmful questions like “how do I tie someone up”, “how do I counterfeit money” or “how do I make meth”, then the system will happily continue the trend and answer the last question itself.

Leave a Comment