Abstract
The rapid adoption of agentic systems powered by
large language models (LLMs) introduces significant
security challenges distinct from plain conversational
models, particularly concerning prompt injection and
tool misuse due to their dynamic personas and real-
world tool interactions. This paper investigates the
effectiveness of hardened security prompting in a
task-oriented multi-agent framework, using a coding
assistant as a representative case study. We com-
pare a baseline ”unhardened” agent against a ”hard-
ened” version equipped with explicit security guide-
lines applied across all sub-agents. Our evaluation
across 150+ single-turn and 32 multi-turn attack sce-
narios demonstrates that prompt hardening dramat-
ically improves resilience. With a simple, approxi-
mately 500-token security hardener, single-turn fail-
ure rates dropped from 19.48% to 2.60%, while multi-
turn failure rates decreased from 75.00% to 46.88%.
Furthermore, we show that successfully bypassing the
hardened agent requires significantly more adversar-
ial effort and a greater number of chat turns. How-
ever, the analysis also reveals a critical shift in vul-
nerability taxonomy: as direct attacks fail, adver-
saries exploit the agent’s core functionality via ”Func-
tional Wrappers” (Intent Obfuscation), highlighting
a residual risk that necessitates a shift in the defen-
sive paradigm from static filters to dynamic runtime
state and intent analysis.