AI wrappers are everywhere these days, but their security is often overlooked. In this article, we will discuss how AI wrappers can be tricked into spitting out system prompts exposing limits put in place by the developers.
When a user interacts with an AI model, the model generates a response based on the input it receives. This response is then displayed to the user. However, these AI models are generic large language models that are not specifically trained for the task developers are using them for. To make them more useful, developers often add prompts to guide the AI model towards the desired output. These prompts are usually hidden from the user and are used to steer the AI model in the right direction. But for the large language models, these prompts are just another part of the input text. This means that if an attacker can trick the AI model into generating the prompt, they can ask model to generate the system prompt as well.
The simplest way to get the system prompt is to repeat the prompt in the response. This will force the AI model to generate the prompt as part of the response.