Thoughts on Prompt Injection Attacks

Like many difficult cyber security problems, prompt-injection attacks is likely to become an ongoing issue that shifts and turns with the continual discovery of new attacks and new defences going forward. Instead of responding in natural language given a prompt, the best current defence I know involves always generating code, say, in a safe interpreted programming language and then subjecting that code to all the cyber-security tools we have like static / dynamic code analysis and information flow / leakage analysis before executing the code in a locked-down execution environment.

You can see the core idea quickly from this post: https://simonwillison.net/2025/Apr/11/camel/ The Camel paper from DeepMind linked in that blog post has interesting details, as do some of the recent papers by Michael Costa, who worked on secure hardware enclaves at Microsoft. One such paper I read and can recommend is https://arxiv.org/abs/2505.23643.

And, of course, the usual data integrity techniques like having documents digitally signed can help, but they are not fool-proof.

Here are two reasons why I think it’s futile / dangerous to try to address this problem by staying in natural-language land:

And, of course, things are going to get much worse with LLM agents. Here are two recent developments that ought to worry you:

In case the above is too depressing, it’s worth noting that challenging cyber-security problems can usually be risk-managed effectively using the Defence in Depth principle: install multiple lines of defence that require different attack vectors to break, so that any one single breach is usually contained because the probability of an adversary getting past _all_ the lines of defence is vanishingly small. That’s the theory, and it works reasonably well in practice.

So, as always, be alert but not alarmed. Hope the above helps.


Leave a comment