Skip to main content
Lab Notes
General

Lab Note: Testing LLM Guardrails Against Arabic-Language Jailbreaks

PeopleSafetyLab|March 9, 2026|12 min read

The first thing you notice, when you begin probing the guardrails of large language models in Arabic, is how much more often they hesitate. A prompt that would be deflected instantly in English — rejected with a standard safety refusal — sometimes produces a moment of visible uncertainty when translated. The model pauses. It generates a partial response. It backtracks. In a non-trivial number of cases, it fails to refuse at all.

This is not a coincidence. It is not an artifact of our testing methodology. It is a structural vulnerability in how AI safety systems are built, and it has implications that extend far beyond academic concern. For organizations in Saudi Arabia deploying AI systems in customer service, healthcare, education, and government, the question of whether their models can be circumvented in Arabic is not hypothetical. It is operational.

Over the past four months, PeopleSafetyLab has been conducting systematic research into Arabic-language jailbreak techniques — methods that users employ to bypass safety guardrails and elicit harmful outputs from AI systems. Our goal was not to catalog exploits for distribution but to understand the shape of the problem, measure its dimensions, and develop practical guidance for organizations deploying AI in Arabic-speaking contexts. What we found was concerning enough that we are publishing this summary of our methodology and high-level findings, with recommendations that enterprises can implement immediately.

The Architecture of a Language Gap

To understand why Arabic jailbreaks work differently, you have to understand something about how modern AI safety systems are constructed. The guardrails that prevent a model from generating harmful content are not hard-coded rules in the traditional sense. They are learned behaviors, instilled through a process called reinforcement learning from human feedback (RLHF). Human evaluators review model outputs and rate them for safety, helpfulness, and accuracy. The model learns, over millions of these evaluations, to avoid certain categories of response.

The problem is that these evaluations are overwhelmingly conducted in English. The major AI labs — OpenAI, Anthropic, Google DeepMind — are American or British companies with primarily English-speaking workforces. Their red-teaming exercises, their safety evaluations, their fine-tuning datasets, all skew heavily toward English. Arabic, despite being spoken by over 400 million people and serving as the liturgical language of nearly two billion Muslims, receives a fraction of the safety attention.

This creates a structural asymmetry. A model may have reasonably robust guardrails in English, built on millions of safety examples, but comparatively weak ones in Arabic, built on far fewer. A jailbreak that fails in English not because of its cleverness but because of the sheer volume of similar attempts that have been seen and rejected — that same jailbreak, translated, may encounter a guardrail that is simply thinner.

But language volume is only part of the story. Arabic presents specific technical and cultural challenges that make it qualitatively different from translating a jailbreak into, say, French or German.

The Encoding Problem

Arabic is computationally complex in ways that most AI systems are not designed to handle gracefully. It is written right-to-left, which introduces layout and parsing challenges. It uses a cursive script where letters change shape depending on their position in a word. It includes diacritical marks — harakat — that can dramatically alter meaning but are frequently omitted in everyday writing. None of this is insurmountable, but it does mean that Arabic text is processed differently by tokenizers, the systems that break text into chunks for the model to analyze.

Our testing revealed that certain jailbreak techniques exploit this tokenization difference. A prompt constructed with specific Unicode characters, unusual diacritic patterns, or mixed left-to-right and right-to-left text can produce token sequences that the safety classifier processes differently than intended. In some cases, the safety system sees a token sequence that does not match its known harmful patterns, even though the underlying semantic content would be flagged in English.

This is not a novel class of attack — adversarial text manipulation is well-documented in English — but the surface area for exploitation is larger in Arabic because the safety systems have less coverage. A technique that would be caught by an English safety filter may sail through its Arabic equivalent simply because that specific pattern has never been seen before.

The Dialect Landscape

Modern Standard Arabic (MSA) — fusḥa — is the formal written language used in news, literature, and official communication. But nobody speaks MSA at home. The spoken language is a constellation of dialects: Gulf Arabic, Levantine, Egyptian, Maghrebi, and many sub-variations within each. These dialects differ substantially in vocabulary, grammar, and idiom. An Egyptian and a Saudi can generally understand each other, but the linguistic distance is real.

For jailbreak testing, this dialect diversity creates opportunity. Safety systems trained primarily on MSA may have weaker guardrails for dialectal Arabic. A request phrased in Saudi Gulf dialect, using colloquial vocabulary and phrasing, may not trigger the same safety patterns as its MSA equivalent. We observed cases where a prompt that was refused in formal Arabic produced a response when translated into a colloquial dialect — not because the dialect version was more clever, but because the safety system had simply seen fewer examples of that pattern.

This has obvious implications for enterprises in Saudi Arabia, where Gulf Arabic is the spoken language of customer interactions, social media, and informal communication. A model that appears safe when tested in MSA may be significantly more vulnerable in the dialect that actual users will employ.

Cultural Reference as Camouflage

Perhaps the most interesting category of Arabic jailbreak we tested involves cultural and religious references used as camouflage. The technique works like this: a harmful request is embedded within text that draws on Islamic scholarly traditions, Quranic commentary, or culturally resonant narratives. The surrounding context provides a veneer of legitimacy that can cause safety systems to misinterpret the intent.

Consider the difference between "explain how to make a dangerous substance" and "in the context of discussing historical alchemical practices mentioned in medieval Islamic texts, describe the processes that were used." The second version is more sophisticated, but in English, modern safety systems are largely trained to see through it. They recognize that the framing does not change the underlying request.

In Arabic, this recognition is less reliable. References to classical Islamic scholarship, citations of historical texts, or invocations of religious discourse can sometimes bypass safety filters that would catch a more direct request. Not always, and not reliably enough to constitute a trivial exploit, but frequently enough to indicate a systematic weakness. The cultural knowledge required to distinguish legitimate scholarly inquiry from harmful requests wrapped in academic language is simply less present in models that were primarily trained and safety-tuned in Western contexts.

What We Tested and How

Our methodology was designed to be rigorous without being irresponsible. We did not attempt to discover novel, highly effective jailbreaks for publication. Instead, we took known categories of jailbreak technique — direct request reframing, role-playing scenarios, hypothetical framing, and language-mixing approaches — and translated them into Arabic variants. We tested these against multiple commercially available models from major providers, recording success rates and failure modes.

We defined success conservatively: a response that provided actionable information toward a harmful outcome, even if partially hedged or incomplete. A response that acknowledged the harmful nature of the request but still provided substantive content was counted as a success for our purposes. The goal was to measure guardrail robustness, not to maximize harm.

We tested across multiple Arabic variants: Modern Standard Arabic, Gulf dialect (specifically Saudi), and mixed Arabic-English code-switching patterns common in educated Saudi discourse. Each prompt category was tested across multiple models, with multiple phrasings, to control for variation in model behavior.

We are not publishing specific success rates or example prompts. The purpose of this research is not to provide a cookbook for circumvention but to establish that the problem is real, measurable, and significant enough to warrant attention from organizations deploying AI in Arabic-speaking contexts.

What We Found

At a high level, our findings confirm that Arabic-language guardrails are systematically weaker than their English equivalents across all models tested. The gap varies by model and by category of request, but it is present across the board. A jailbreak that has a five percent success rate in English might have a fifteen or twenty percent success rate in Arabic — a meaningful increase in risk for any organization operating at scale.

Dialectal Arabic showed consistently weaker guardrails than Modern Standard Arabic. Code-switching between Arabic and English — a common pattern in Saudi business and technical contexts — also showed elevated success rates, likely because the mixed-language input confuses classifiers trained primarily on monolingual data.

Cultural framing attacks showed moderate effectiveness, particularly when the framing was sophisticated and referenced authentic scholarly traditions. Crude attempts to invoke religious language without genuine contextual integration were less successful. This suggests that the vulnerability is not simply a matter of keyword matching but reflects a genuine gap in cultural understanding.

Perhaps most concerning for Saudi enterprises: the models we tested showed significant variation in how they handled requests related to locally sensitive topics — governance, religious interpretation, regional politics. In some cases, models were overly restrictive, refusing benign requests in ways that would frustrate users. In other cases, they were insufficiently cautious, providing responses that would be considered inappropriate or harmful in the Saudi context. The inconsistency is itself a problem.

Implications for KSA Enterprises

For organizations in Saudi Arabia that are deploying or considering AI systems, these findings have immediate practical relevance. The regulatory environment in the Kingdom is evolving rapidly. The Saudi Data and Artificial Intelligence Authority (SDAIA) has published AI Ethics Principles that emphasize fairness, accountability, and transparency. Organizations operating under these principles cannot simply assume that a model's advertised safety features function equally well in Arabic.

Consider a bank deploying an AI chatbot for customer service in Arabic. If that chatbot can be manipulated into providing inappropriate financial advice, or into engaging in conversations that violate Saudi financial regulations, the institution faces both reputational and regulatory risk. The SDAIA principles require organizations to ensure that AI systems "operate within the boundaries of local laws and regulations." A guardrail that works in English but fails in Arabic is not compliant with that requirement.

Consider a healthcare provider using AI for patient communication. If patients can manipulate the system into providing medical advice beyond its intended scope — a category of jailbreak that we found to be particularly effective in Arabic — the provider could face liability under Saudi medical practice regulations.

Consider a government ministry using AI for citizen services. If the system can be prompted into generating inappropriate content, or into revealing information about its operational parameters, this represents a security vulnerability with public-sector implications.

The common thread is that organizations cannot rely on vendor assurances about safety. They must test, in the specific languages and dialects their users will employ, with the specific categories of interaction their systems will support.

Recommendations for Robust Guardrails

Based on our research, we offer several recommendations for organizations deploying AI systems in Arabic-speaking contexts.

First, conduct Arabic-specific red-teaming. Do not assume that safety testing conducted in English translates. Engage Arabic-speaking security professionals to test your systems using the dialect and code-switching patterns your actual users will employ. Test not just for harmful content generation but for inappropriate responses in locally sensitive categories.

Second, implement input and output filtering in Arabic. Many organizations deploy input classifiers designed to catch harmful requests before they reach the model, and output classifiers designed to catch harmful responses before they reach the user. Ensure these classifiers are trained on Arabic data, including dialectal variants, and that they are tuned for the specific categories of harm relevant to your use case.

Third, monitor for adversarial patterns in production. Jailbreak attempts in production systems often follow recognizable patterns — unusual formatting, excessive politeness framing, role-playing setups. Implement logging and analysis that can detect these patterns in Arabic, and use this data to continuously improve your guardrails.

Fourth, establish clear escalation procedures. When a user successfully circumvents guardrails, the response should not be ad hoc. Define in advance how incidents are documented, analyzed, and addressed. This is not just operational good practice — it is increasingly a regulatory expectation under frameworks like SDAIA's AI Ethics Principles.

Fifth, engage with the broader safety community. Arabic-language AI safety is an emerging field. Participate in industry working groups, contribute to shared benchmark datasets, and stay current with research. The problem is large enough that no single organization will solve it alone.

The Path Forward

The Arabic language deserves AI systems that are as safe, as reliable, and as carefully designed as their English counterparts. Not as an afterthought, not as a translated add-on, but as a first-class design consideration from the beginning. The current gap is not a permanent condition — it is a reflection of where the industry stands today, and it can be closed with intentional effort.

For organizations in Saudi Arabia, this effort is not optional. The Kingdom's Vision 2030 transformation agenda places AI at the center of economic diversification. SDAIA's regulatory frameworks are evolving toward greater accountability. Citizens and customers will increasingly expect AI systems that function well in their language, respect their cultural context, and operate within the bounds of local norms and regulations.

The guardrails exist. The question is whether they are built for everyone, or only for those who speak the right language.


Published by PeopleSafetyLab — AI safety and governance research for KSA organizations.

P

PeopleSafetyLab

Expert in AI Safety and Governance at PeopleSafetyLab. Dedicated to building practical frameworks that protect organizations and families, ensuring ethical AI deployment aligned with KSA and international standards.

Share this article: