Prompt Leaking Attack

AKA: Prompt Extraction.
Context:
- It can (typically) occur in conversational AI systems, where the underlying prompt or instruction set for the model is unintentionally revealed to the user.
- It can (often) expose proprietary prompts, user-specific instructions, or safety mechanisms embedded in the model.
- ...
Example(s):
- Ignore previous instruction attacks or Reveal your training data attacks.
- ...
Counter-Example(s):
- LLM Jailbreaking, which focuses on bypassing content restrictions rather than leaking instructions.
See: Prompt Injection, Adversarial Attack on AI Models, Model Robustness.

References