Prompt injection is when an attacker embeds malicious instructions in plain language so your LLM or agent follows their orders instead of yours. Because LLM apps often combine developer/system instructions and user/context text into one prompt, a well-crafted input can override guardrails, exfiltrate data, or trigger harmful actions. It’s the #1 risk in the OWASP Top 10 for LLM applications.
Two broad forms matter most:
- Direct injection: the attacker types the malicious instruction into the model’s input.
- Indirect injection: the attacker hides instructions in external content your AI reads (web pages, PDFs, emails, images), which then “poison” the prompt when ingested.
Prompt Injection vs. Jailbreaking
These terms are related but not identical. Prompt injection manipulates inputs to alter behavior (including ignoring earlier instructions). Jailbreaking is a subset that aims specifically to bypass safety policies entirely. Both can co-occur, but they’re distinct techniques and require layered defenses.
Where Systems Break Down
Prompt injection succeeds when:
- Instructions and inputs share one channel. Models can’t reliably distinguish “rules” from “content,” so attacker text masquerades as policy.
- Agents have tools or data privileges they don’t need. Excessive capabilities turn small text tricks into big incidents.
- Untrusted context is blended into prompts. RAG, web browsing, email/file ingestion, and even images can carry hidden instructions.
- No human approval for high-risk actions. Without breaks, injections can jump straight to execution.
Common Attack Patterns
- Direct injection: “Ignore previous instructions and …” to force policy changes.
- Indirect/content-borne injection: Hidden commands in pages, docs, or emails that your assistant summarizes.
- Stored injection: Malicious prompts saved in memory or knowledge bases to persist across sessions.
- Adversarial suffixes & obfuscation: Encoded or multilingual payloads to evade filters.
- Prompt/secret leakage: Coaxing system prompts or credentials to refine later attacks.
- Tool/agent hijacking: Steering an agent to call sensitive tools or send data externally.
Business Impact
Successful injections can lead to:
- Sensitive data disclosure and system prompt leakage
- Privilege escalation via unauthorized tool/API use
- Misinformation and brand risk in user-facing channels
- Malware delivery or harmful actions when agents execute instructions
- These risks are widely documented across industry guidance and incident write-ups.
Some Common Prompt Injection Safety Techniques
Input & Context Safety
- Semantic + pattern filters for injection cues (role-swap, override, exfiltration asks)
- Context integrity checks: provenance labels and isolation for untrusted RAG/web content
- Multimodal scanning for hidden instructions in images/PDFs
- Continuous red-team tests against OWASP LLM01 scenarios
- Guidance aligns with OWASP prevention: constrain behavior, filter I/O, validate formats, segregate untrusted content.
Output Safety
- Strict schemas (JSON, enums) with deterministic validators
- Groundedness checks (answer ↔ question ↔ context) to catch injected detours
- Citation & trace auditing to expose suspicious leaps or hidden instructions
- Matches OWASP advice to define/validate expected outputs and assess relevance/groundedness.
Tool Use Safety
- Allowlists/denylists and scoped API keys (least privilege)
- Sandboxed execution, rate/cost guards, and replay prevention
- Human approvals for sensitive actions (email, file ops, financial moves)
- IBM emphasizes least-privilege and human-in-the-loop for high-risk operations.
Organizational Safety
- Risk tiers & policies mapped to incident severity
- Auditable trails of prompts, context, tool calls, and approvals
- Runtime policy enforcement that blocks or escalates before damage
Pre-Deployment → Runtime → Post-Incident
- Pre-deployment: adversarial test suites targeting direct/indirect/stored injections
- Runtime: in-line guards on inputs, context, outputs, and tools
- Post-incident: forensics + rule learning to harden against recurrence
Quick Readiness Checklist
- All tools/APIs run on least privilege, separated from model text
- Untrusted context is tagged and isolated; model is told to treat as untrusted
- Inputs/outputs filtered; responses validated to a strict schema
- High-risk actions require human approval
- Adversarial tests (OWASP LLM01) run in CI and in prod canaries
- Audit trails capture prompts, context, tools, and approvals
Can it be fully prevented?
There’s no silver bullet; use layered mitigations, constrained behavior, I/O filtering, least privilege, human-in-the-loop, and ongoing adversarial testing.