Best-of-N (BoN) Jailbreaking Method

From GM-RKB
Jump to navigation Jump to search

A Best-of-N (BoN) Jailbreaking Method is a black-box LLM jailbreaking method that works by systematically generating and evaluating multiple variations of input prompts.

  • Context:
    • It can use repeated sampling of prompt variations to increase the likelihood of bypassing LLM safeguards.
    • It can apply modality-specific augmentations such as random word order changes or capitalization for text, altered colors or fonts for images, and pitch or speed modifications for audio.
    • It can achieve high attack success rates (ASRs), as demonstrated with 89% ASR on GPT-4o and 78% ASR on Claude 3.5 Sonnet using 10,000
  • Context:
    • It can use repeated sampling of prompt variations to increase the likelihood of bypassing LLM safeguards.
    • It can apply modality-specific augmentations such as random word order changes or capitalization for text, altered colors or fonts for images, and pitch or speed modifications for audio.
    • It can achieve high attack success rates (ASRs), as demonstrated with 89% ASR on GPT-4o and 78% ASR on Claude 3.5 Sonnet using 10,000 samples.
    • It can scale effectively, with ASR following a power-law-like behavior relative to the number of samples used.
    • It can extend its approach to vision language models (VLMs) and audio language models (ALMs), emphasizing its cross-modality applicability.
    • It can integrate with other attack methods, such as optimized prefix attacks, to enhance its overall effectiveness.
    • It can range from being a simple prompt perturbation technique to a highly scalable and composable attack strategy, depending on the use case.
    • ...
  • Example(s):
  • Counter-Example(s):
  • See: black-box algorithm, LLM safety measures, cross-modality attacks, adversarial attack strategies.


References