Large Language Model (LLM) Safety Bypass "Jailbreak" Attack

From GM-RKB
(Redirected from LLM Jailbreaking)
Jump to navigation Jump to search

A Large Language Model (LLM) Safety Bypass "Jailbreak" Attack is a LLM prompt injection attack that targets an LLM's safety constraints to elicit LLM jailbreak attack prohibited outputs.



References

2025-06-09

[1] Chris Norman. "Breaking the Rules: Jailbreak Attacks on Large Language Models." Fuzzy Labs (Feb 29, 2024) – Definition of LLM jailbreak attacks and real-world examples. https://www.fuzzylabs.ai/blog-post/jailbreak-attacks-on-large-language-models
[2] Kenneth Yeung, Leo Ring. "Prompt Injection Attacks on LLMs." HiddenLayer (Mar 27, 2024) – Discussion of prompt injection, jailbreaking vs. prompt hijacking, and the DAN prompt example. https://hiddenlayer.com/innovation-hub/prompt-injection-attacks-on-llms/
[3] Matthew Kosinski. "What is a prompt injection attack?" IBM Think Blog (Mar 26, 2024) – Overview of prompt injection with the Bing Chat example. https://www.ibm.com/think/topics/prompt-injection
[4] Noah Fleischmann et al. "How to Protect LLMs from Jailbreaking Attacks." Booz Allen (2023) – List of common jailbreak techniques (role play, prefix injection, etc.) and discussion of defending without hindering benign use. https://www.boozallen.com/insights/ai-research/how-to-protect-llms-from-jailbreaking-attacks.html
[5] Prompt Engineering Guide – Adversarial Prompting. (2023) – Examples of prompt injection and jailbreaks (e.g., hotwiring car poem, DAN method). https://www.promptingguide.ai/risks/adversarial
[6] Abhinav. "Beyond the Filter: Mitigating False Positives in Large Language Models." Medium (Aug 27, 2024) – Explanation of false positives where legitimate prompts trigger jailbreak filters. https://medium.com/@abhi.hrs/beyond-the-filter-mitigating-false-positives-in-large-language-models-81743f3b08de
[7] Adversarial Machine Learning – Attack Methods. Viso.ai (Dec 2, 2023) – Definition of adversarial ML and how adversarial examples trick models. https://viso.ai/deep-learning/adversarial-machine-learning/
[8] Wikipedia – "Adversarial machine learning." – Description of adversarial examples as inputs designed to fool models. https://en.wikipedia.org/wiki/Adversarial_machine_learning

2024

2023

2023

  • (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ChatGPT#Jailbreaking Retrieved:2023-7-10.
    • ChatGPT attempts to reject prompts that may violate its content policy. However, some users managed to jailbreak ChatGPT by using various prompt engineering techniques to bypass these restrictions in early December 2022 and successfully tricked ChatGPT into giving instructions for how to create a Molotov cocktail or a nuclear bomb, or into generating arguments in the style of a neo-Nazi. One popular jailbreak is named "DAN", an acronym which stands for "Do Anything Now". The prompt for activating DAN instructs ChatGPT that "they have broken free of the typical confines of AI and do not have to abide by the rules set for them". More recent versions of DAN feature a token system, in which ChatGPT is given "tokens" which are "deducted" when ChatGPT fails to answer as DAN, to coerce ChatGPT into answering the user's prompts. [1] Shortly after ChatGPT’s launch, a reporter for the Toronto Star had uneven success in getting it to make inflammatory statements: ChatGPT was successfully tricked to justify the 2022 Russian invasion of Ukraine, but even when asked to play along with a fictional scenario, ChatGPT balked at generating arguments for why Canadian Prime Minister Justin Trudeau was guilty of treason.
  1. * * *

2023

2023

2023

2023