The Assistant Prefill feature available in many LLMs can leave models vulnerable to safety alignment bypasses (aka jailbreaking). This article builds on prior research to investigate the practical aspects of prefill security.
The article explores the concept of Assistant Prefill, a feature offered by many LLM providers that allows users to prefill the beginning of a model’s response to guide its output. While designed for practical purposes, such as enforcing response formats like JSON or XML, it has a critical vulnerability: it can be exploited to bypass safety alignments. Prefilling a model’s response with harmful or affirmative text significantly increases the likelihood of the model producing unsafe or undesirable outputs, effectively “jailbreaking” it.
Intrigued by a recent research paper about LLM safety alignment, I decided to investigate if the theoretical weaknesses described in the paper could be exploited in practice. This article describes various experiments with live and local models and discusses: