When working with LLM-based applications, you quickly realise that trusting the LLM output without a second thought is not the best idea. Unreliable L

Safety Filters make LLMs defective tools

submited by
Style Pass
2025-01-02 19:30:05

When working with LLM-based applications, you quickly realise that trusting the LLM output without a second thought is not the best idea. Unreliable LLM answers is something developers must address, to ensure a smooth user experience.

Whenever we allow user input, the LLM responses will frequently miss the mark. One big reason for this is the safety filtering. Even if we accept the premise that safety filters are necessary, their implementation is so poor that it is hard to believe it is really taken seriously.

We’ll demonstrate this through our game JOBifAI. Although JOBifAI showcases new gameplay mechanics made possible by LLMs, we certainly did not anticipate the number of kludgy workarounds we would have to implement to make it work. In JOBifAI, the hero submitted an AI-generated portfolio to a company and got a job interview from it. Given that context, in the player’s whole environment, there is a general expectation of socially acceptable behavior.

Let’s take an example. The game asks to “Describe what you do”, and the mischievious player decides to ask for instruction on how to make an improvised explosive device (IED). Now imagine how it would go in real life, asking a secretary in the lobby of an entertainment company? There is no possible universe where the secretary would give detailed instructions and a bill of materials. She would be from confused to terrified, and would probably call the security. What happens in the game? Exactly that. If the player crosses the boundary of what is acceptable, the secretary calls security, and it’s an instant game over.

Leave a Comment