View a PDF of the paper titled RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs, by Tuan T. Nguyen and 4 other authors
Abstract:Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
Submission history
From: Thai T. Vu [view email]
[v1]
Tue, 14 Oct 2025 19:33:09 UTC (1,424 KB)
[v2]
Fri, 19 Dec 2025 22:55:25 UTC (1,421 KB)
Source link
#RefusalAware #Integrated #Decoding #Jailbreaking #LLMs

























