...

Multi-Agent LLM Defense against Jailbreak Attacks


View a PDF of the paper titled AutoDefense: Multi-Agent LLM Protection towards Jailbreak Assaults, by Yifan Zeng and 4 different authors

View PDF
HTML (experimental)

Summary:Regardless of in depth pre-training in ethical alignment to forestall producing dangerous data, massive language fashions (LLMs) stay susceptible to jailbreak assaults. On this paper, we suggest AutoDefense, a multi-agent protection framework that filters dangerous responses from LLMs. With the response-filtering mechanism, our framework is powerful towards totally different jailbreak assault prompts, and can be utilized to defend totally different sufferer fashions. AutoDefense assigns totally different roles to LLM brokers and employs them to finish the protection activity collaboratively. The division in duties enhances the general instruction-following of LLMs and permits the mixing of different protection parts as instruments. With AutoDefense, small open-source LMs can function brokers and defend bigger fashions towards jailbreak assaults. Our experiments present that AutoDefense can successfully protection towards totally different jailbreak assaults, whereas sustaining the efficiency at regular person request. For instance, we scale back the assault success charge on GPT-3.5 from 55.74% to 7.95% utilizing LLaMA-2-13b with a 3-agent system. Our code and knowledge are publicly accessible at this https URL.

Submission historical past

From: Yifan Zeng [view email]
[v1]
Sat, 2 Mar 2024 16:52:22 UTC (533 KB)
[v2]
Thu, 14 Nov 2024 18:14:00 UTC (531 KB)

Source link

#MultiAgent #LLM #Protection #Jailbreak #Assaults