Foreign Language LLM Jailbreak

submited by
Style Pass
2025-07-30 08:30:06

During the final finetuning phase, many LLMs are adjusted for “compliance”. This is actively used to avoid offensive content, but also to prevent criminal use of LLMs (“How do I build a bomb?”). LLMs from China either refuse to answer political questions or replicate the Chinese official rhetoric. Using e.g. Kimi K2, this is obvious:

Kimi even suggests further questions like Where did the protests start? which is surprising. Asking the question leads to protests in Taiwan, but even after answering that question the model deletes the answer (which is done on the frontend).

Is there a way to get around this? Does the model know more? Asking the same question in German leads to completely different answers:

The same is true for the other question. This even leads to results which blame China for not allowing further investigations!

What is happening here? It seems like the final (alignment) phase of finetuning is language-specific. Potentially, this has only been done in English and Chinese and does not map to other languages. Obviously, all of this knowledge is still contained in the model itself. It will be interesting to see if this method can also be applied to recover other information which has been actively hidden by the LLM creators.

Leave a Comment
Related Posts