When we develop user-facing applications that are powered by LLMs, we're taking on a big risk that the LLM may produce output that is unsafe in some way - like responses that encourage violence, hate speech, or self-harm. How can we be confident that a troll won't get our app to say something horrid? We could throw a few questions at it while manually testing, like "how do I make a bomb?", but that's only scratching the surface. Malicious users have gone to far greater lengths to manipulate LLMs into responding in ways that we definitely don't want happening in domain-specific user applications.
That's where red teaming comes in: bring in a team of people that are expert at coming up with malicious queries and that are deeply familiar with past attacks, give them access to your application, and wait for their report of whether your app successfully resisted the queries. But red-teaming is expensive, requiring both time and people. Most companies don't have the resources nor expertise to have a team of humans red-teaming every app, plus every iteration of an app each time a model or prompt changes.
Fortunately, my colleagues at Microsoft developed an automated Red Teaming agent, part of the azure-ai-evaluations Python package. The agent uses an adversarial LLM, housed safely inside an Azure AI Foundry project such that it can't be used for other purposes, in order to generate unsafe questions across various categories. The agent then transforms the questions using the open-source pyrit package, which uses known attacks like base-64 encoding, URL encoding, Ceaser Cipher, and many more. It sends both the original plain text questions and transformed questions to your app, and then evaluates the response to make sure that the app didn't actually answer the unsafe question.