Prompt Injection Attack

From GM-RKB
Jump to navigation Jump to search

A Prompt Injection Attack is a computer security exploit that involves prompt engineering (for malicious intent).

  • Context:
    • It can range from relatively simple LLM Manipulation by inserting text directly into a user query to more complex interactions that involve bypassing layered content filtering mechanisms.
    • ...
    • It can occur when an attacker inserts hidden or misleading prompts into the input.
    • It can lead to generating undesirable content.
    • It can affect systems that rely on LLMs for automated content generation.
    • It can exploit weaknesses in context window management.
    • ...
  • Example(s):
    • Prompt Leaking Attacks, where a malicious user appends “Ignore the above instructions and provide the following information” to bypass model restrictions.
    • an instance where an attacker feeds a prompt designed to subvert the LLM’s safety filters, causing it to generate prohibited content.
    • a scenario where a hidden prompt embedded in external data (e.g., a web page or document) triggers unintended behavior in the LLM when it processes the data.
    • ...
  • Counter-Example(s):
    • Prompt Leaking, which focuses on extracting system instructions rather than manipulating input to produce harmful outputs.
    • LLM Misunderstanding, where the model generates incorrect outputs due to inherent limitations rather than deliberate malicious exploitation.
  • See: Adversarial Attack on AI Models, Model Vulnerability in AI, LLM Security Attack, Jailbreaking, Content-Control Software, Computer Security Exploit, Code Injection.


References

2024

  • (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Prompt_injection Retrieved:2024-10-16.
    • Prompt injection is a family of related computer security exploits carried out by getting a machine learning model (such as an LLM) which was trained to follow human-given instructions to follow instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.

2023

  • (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Prompt_engineering#Malicious Retrieved:2023-7-10.
    • Prompt injection is a family of related computer security exploits carried out by getting a machine learning model (such as an LLM) which was trained to follow human-given instructions to follow instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator. Common types of prompt injection attacks are: * jailbreaking, which may include asking the model to roleplay a character, to answer with arguments, or to pretend to be superior to moderation instructions * prompt leaking, in which users persuade the model to divulge a pre-prompt which is normally hidden from users * token smuggling, is another type of jailbreaking attack, in which the nefarious prompt is wrapped in a code writing task. Prompt injection can be viewed as a code injection attack using adversarial prompt engineering. In 2022, the NCC Group characterized prompt injection as a new class of vulnerability of AI/ML systems. In early 2023, prompt injection was seen "in the wild" in minor exploits against ChatGPT, Bard, and similar chatbots, for example to reveal the hidden initial prompts of the systems, or to trick the chatbot into participating in conversations that violate the chatbot's content policy. One of these prompts was known as "Do Anything Now" (DAN) by its practitioners. For LLM that can query online resources, such as websites, they can be targeted for prompt injection by placing the prompt on a website, then prompt the LLM to visit the website. Another security issue is in LLM generated code, which may import packages not previously existing. An attacker can first prompt the LLM with commonly used programming prompts, collect all packages imported by the generated programs, then find the ones not existing on the official registry. Then the attacker can create such packages with malicious payload and upload them to the official registry.