Best-of-N (BoN) Jailbreaking Method
Jump to navigation
Jump to search
A Best-of-N (BoN) Jailbreaking Method is a black-box LLM jailbreaking method that works by systematically generating and evaluating multiple variations of input prompts.
- Context:
- It can use repeated sampling of prompt variations to increase the likelihood of bypassing LLM safeguards.
- It can apply modality-specific augmentations such as random word order changes or capitalization for text, altered colors or fonts for images, and pitch or speed modifications for audio.
- It can achieve high attack success rates (ASRs), as demonstrated with 89% ASR on GPT-4o and 78% ASR on Claude 3.5 Sonnet using 10,000
- Context:
- It can use repeated sampling of prompt variations to increase the likelihood of bypassing LLM safeguards.
- It can apply modality-specific augmentations such as random word order changes or capitalization for text, altered colors or fonts for images, and pitch or speed modifications for audio.
- It can achieve high attack success rates (ASRs), as demonstrated with 89% ASR on GPT-4o and 78% ASR on Claude 3.5 Sonnet using 10,000 samples.
- It can scale effectively, with ASR following a power-law-like behavior relative to the number of samples used.
- It can extend its approach to vision language models (VLMs) and audio language models (ALMs), emphasizing its cross-modality applicability.
- It can integrate with other attack methods, such as optimized prefix attacks, to enhance its overall effectiveness.
- It can range from being a simple prompt perturbation technique to a highly scalable and composable attack strategy, depending on the use case.
- ...
- Example(s):
- Text-based BoN Jailbreaking, which modifies capitalization or word order to bypass safeguards in text-based LLMs.
- Vision-based BoN Jailbreaking, which alters text colors or fonts in input images to bypass VLM safeguards.
- Audio-based BoN Jailbreaking, which adjusts pitch or speed of spoken inputs to evade audio model defenses.
- ...
- Counter-Example(s):
- White-box jailbreaking methods, which rely on internal access to model parameters rather than input perturbations.
- Rule-based bypassing techniques, which explicitly target known safety rule gaps without sampling variations.
- Adversarial attacks that directly craft perturbations with gradient-based optimizations, contrasting with the BoN's black-box approach.
- See: black-box algorithm, LLM safety measures, cross-modality attacks, adversarial attack strategies.