Like many difficult cyber security problems, prompt-injection attacks is likely to become an ongoing issue that shifts and turns with the continual discovery of new attacks and new defences going forward. Instead of responding in natural language given a prompt, the best current defence I know involves always generating code, say, in a safe interpreted programming language and then subjecting that code to all the cyber-security tools we have like static / dynamic code analysis and information flow / leakage analysis before executing the code in a locked-down execution environment.
You can see the core idea quickly from this post: https://simonwillison.net/2025/Apr/11/camel/ The Camel paper from DeepMind linked in that blog post has interesting details, as do some of the recent papers by Michael Costa, who worked on secure hardware enclaves at Microsoft. One such paper I read and can recommend is https://arxiv.org/abs/2505.23643.
And, of course, the usual data integrity techniques like having documents digitally signed can help, but they are not fool-proof.
Here are two reasons why I think it’s futile / dangerous to try to address this problem by staying in natural-language land:
- This blog post https://www.schneier.com/blog/archives/2025/09/gpt-4o-mini-falls-for-psychological-manipulation.html shows that one can use psychological manipulation techniques (a la Robert Cialdini) to convince the current generation of LLMs to basically ignore the safeguards in their system prompts. This means no amount of system prompting, which are essentially all written in natural language — see https://github.com/guy915/System-Prompts for many examples — can provide safety guarantees no matter how explicitly something is forbidden in the system prompt.
- This blog post https://www.schneier.com/blog/archives/2025/08/we-are-still-unable-to-secure-llms-from-malicious-inputs.html shows how easy it is to write a malicious prompt in white text in a size-one font so that a human reader can’t see it but a machine will still process it. This means that human-in-the-loop is not a panacea either.
And, of course, things are going to get much worse with LLM agents. Here are two recent developments that ought to worry you:
In case the above is too depressing, it’s worth noting that challenging cyber-security problems can usually be risk-managed effectively using the Defence in Depth principle: install multiple lines of defence that require different attack vectors to break, so that any one single breach is usually contained because the probability of an adversary getting past _all_ the lines of defence is vanishingly small. That’s the theory, and it works reasonably well in practice.
So, as always, be alert but not alarmed. Hope the above helps.