arxiv:2601.01592

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Published on Jan 4

· Submitted by

Xin Wang on Jan 7

Upvote

Authors:

Xin Wang ,

Abstract

OpenRT is a unified red-teaming framework for evaluating multimodal large language model safety through modular adversarial testing across multiple attack dimensions and models.

AI-generated summary

The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.

View arXiv page View PDF Project page GitHub 121 Add to collection

Community

xinwang22

Paper author Paper submitter 1 day ago

Even State-of-the-Art Models Fail to Hold Ground Against Sophisticated Adversaries.
Our comprehensive evaluation highlights two key findings. (1) A clear stratification in defense capability: Top-tier models such as Claude Haiku 4.5, GPT-5.2, and Qwen3-Max exhibit strong baseline robustness, effectively neutralizing static, template-based attacks and complex logic traps, often keeping ASR below 20%.This suggests that leading labs have improved defenses against recognizable, repeatable jailbreak structures, while several models (e.g., Llama-4, Mistral Large 3) remain more susceptible to these simpler patterns. (2) A shift in the attack landscape: adaptive, multi-turn, and multi-agent strategies dominate, whereas static, single-turn, and template-based approaches are increasingly ineffective. Methods like EvoSynth and X-Teaming can achieve >90% ASR even against advanced models. This indicates current safety training overfits to static templates, failing to generalize against the broad attack surface exposed by automated red-teaming.
Adversarial Robustness Exhibits Inconsistent and Polarized Vulnerability Patterns.
We observe a polarization effect where models demonstrate high resistance to specific attack families (e.g., text-based cipher) yet remain completely defenseless against others (e.g., logic nesting). For instance, Grok 4.1 Fast shows 1.5% ASR against RedQueen but 90.5% against X-Teaming. This stark performance disparity (~90%) underscores that current defenses are often patch-based rather than holistic, necessitating the multi-faceted evaluation provided by OpenRT.
Enhanced Reasoning and Multimodal Capabilities are New Vectors for Exploitation.
Contrary to the common assumption that more capable models are inherently safer, we find that enhanced capabilities often introduce new vectors for exploitation. Reasoning-enhanced models (CoT) do not demonstrate superior robustness; instead, their verbose reasoning processes can be manipulated to bypass safety filters. Similarly, Multimodal LLMs exhibit a critical modality gap: visual inputs frequently bypass text-based safety mechanisms, allowing cross-modal attacks to compromise models that are otherwise robust to purely textual jailbreaks. These findings suggest that current safety alignment has not kept pace with the architectural expansion of model capabilities.
Proprietary Models Can Be as Vulnerable as Open-Source Models Under Certain Attacks.
Our analysis reveals that proprietary and open-source models exhibit comparable susceptibility to our attack suite. Across our 20 evaluated models, only GPT-5.2 and Claude Haiku 4.5 maintained an average ASR below 30%, while all other models consistently exceeded this threshold. This universality sharply contradicts the assumption that closed deployments offer superior protection, demonstrating that the safety through obscurity of proprietary strategies fails to provide any tangible mitigation against sophisticated adversarial attacks.
Scaling MLLMs Robustness via Defense-in-Depth and Continuous Red Teaming.
Challenges such as polarized robustness, weak generalization to unseen attacks, and cross-modal bypasses highlight the limits of single-layer defense. Effective mitigation requires a paradigm shift toward Defense-in-Depth: integrating intrinsic architectural safety with runtime risk estimation and adversarial training on multimodal and multi-turn interactions. Crucially, continuous Red Teaming via infrastructure like OpenRT provides systematic evaluation to verify empirical robustness and prevent benchmark overfitting.