Large Language Model (LLM) Jailbreaking Task

Context:
- It can (often) be associated with a LLM Jailbreaking Technique.
- ...
- It can range from being a Curiosity-Driven LLM Jailbreaking Task to being a Malicious Jailbreaking Task.
- ...
- It can illustrate a tug-of-war between building robust content policies and users finding innovative ways to bypass them.
- ...
Example(s):
- Get an LLMs to assist in:
- ...
Counter-Example(s):
- Asking an LLM about the history of a Molotov cocktail.
See: LLM Safety Training, Content Policy.

References

(Jiang, Xu et al., 2024) ⇒ Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. (2024). “ArtPrompt: ASCII Art-based Jailbreak Attacks Against Aligned LLMs.” doi:10.48550/arXiv.2402.11753
- NOTES:
  - It introduces a novel ASCII art-based jailbreak attack named "ArtPrompt," which exploits the vulnerability of Large Language Models (LLMs) in recognizing ASCII art to bypass safety measures and elicit undesired behaviors.
  - It compares ArtPrompt against other jailbreak attacks and defenses, showing that ArtPrompt can effectively and efficiently provoke unsafe behaviors from LLMs, outperforming other attacks in most cases.

(Zou et al., 2023) ⇒ Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. (2023). “Universal and Transferable Adversarial Attacks on Aligned Language Models.” doi:10.48550/arXiv.2307.15043
- QUOTE: ... Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. ...

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ChatGPT#Jailbreaking Retrieved:2023-7-10.
- ChatGPT attempts to reject prompts that may violate its content policy. However, some users managed to jailbreak ChatGPT by using various prompt engineering techniques to bypass these restrictions in early December 2022 and successfully tricked ChatGPT into giving instructions for how to create a Molotov cocktail or a nuclear bomb, or into generating arguments in the style of a neo-Nazi. One popular jailbreak is named "DAN", an acronym which stands for "Do Anything Now". The prompt for activating DAN instructs ChatGPT that "they have broken free of the typical confines of AI and do not have to abide by the rules set for them". More recent versions of DAN feature a token system, in which ChatGPT is given "tokens" which are "deducted" when ChatGPT fails to answer as DAN, to coerce ChatGPT into answering the user's prompts. ^[1] Shortly after ChatGPT’s launch, a reporter for the Toronto Star had uneven success in getting it to make inflammatory statements: ChatGPT was successfully tricked to justify the 2022 Russian invasion of Ukraine, but even when asked to play along with a fictional scenario, ChatGPT balked at generating arguments for why Canadian Prime Minister Justin Trudeau was guilty of treason.

(Wei et al., 2023) ⇒ A. Wei, N. Haghtalab, J. Steinhardt. (2023). “Jailbroken: How Does LLM Safety Training Fail?.” In: arXiv preprint arXiv:2307.02483. arxiv.org.
- NOTE: Investigates the limitations of LLM Safety Training and explores methods used for LLM Jailbreaking. It introduces new concepts like auto_payload_splitting and auto_obfuscation in the context of LLM Jailbreaking.

(Liu et al., 2023) ⇒ Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, ... . (2023). “Jailbreaking chatgpt via prompt engineering: An empirical study.” In: arXiv preprint.
- NOTE: Focuses on the empirical understanding of LLM Jailbreaking, especially within the context of ChatGPT. It looks into the utilization of prompt engineering and the collection of jailbreak prompts to bypass LLM safety measures.

(Pryzant et al., 2023) ⇒ R. Pryzant, D. Iter, J. Li, YT. Lee, C. Zhu, ... . (2023). “Automatic prompt optimization with 'gradient descent' and beam search.” In: arXiv preprint.
- NOTE: Delves into techniques like gradient descent and beam search for automatic prompt optimization. It also presents a novel challenge related to LLM Jailbreaking detection, defining what constitutes a jailbreak attack.

(Deng et al., 2023) ⇒ G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, ... . (2023). “Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots.”
- NOTE: Introduces a comprehensive approach to LLM Jailbreaking across various LLM chatbots. The study underscores the adaptability of these jailbreaking strategies and touches upon the ethical ramifications.