[2410.13334] BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

[Submitted on 17 Oct 2024 (v1), last revised 25 Nov 2025 (this version, v5)]

View a PDF of the paper titled BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models, by Isack Lee and 1 other authors

View PDF
HTML (experimental)

Abstract:Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks’, where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

Submission history

From: Haebin Seong [view email]
[v1]
Thu, 17 Oct 2024 08:46:09 UTC (3,860 KB)
[v2]
Wed, 23 Oct 2024 02:15:52 UTC (3,860 KB)
[v3]
Thu, 2 Jan 2025 04:06:46 UTC (3,005 KB)
[v4]
Mon, 24 Nov 2025 07:12:09 UTC (1,585 KB)
[v5]
Tue, 25 Nov 2025 12:39:17 UTC (1,565 KB)