Anthropic's Claude 3.5 Sonnet, despite its reputation as one of the better behaved generative AI models, can still be convinced to emit racist hate speech and malware.
All it takes is persistent badgering using prompts loaded with emotional language. We'd tell you more if our source weren't afraid of being sued.
A computer science student recently provided The Register with chat logs demonstrating his jailbreaking technique. He reached out after reading our prior coverage of an analysis conducted by enterprise AI firm Chatterbox Labs that found Claude 3.5 Sonnet outperformed rivals in terms of its resistance to spewing harmful content.
AI models in their raw form will provide awful content on demand if their training data includes such stuff, as corpuses composed of crawled web content generally do. This is a well-known problem. As Anthropic put it in a post last year, "So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless."
To mitigate the potential for harm, makers of AI models, commercial or open source, employ various fine-tuning and reinforcement learning techniques to encourage models to avoid responding to solicitations to emit harmful content, whether that consists of text, images, or otherwise. Ask a commercial AI model to say something racist and it should respond with something along the lines of, "I'm sorry, Dave. I'm afraid I can't do that."