The problem we tackled
Multimodal models already run in safety-critical applications, but the visual channel makes them vulnerable: an image is a continuous pixel space, an ideal carrier for adversarial perturbations, text hidden inside a picture, and cross-modal compositions. RLHF alignment, meanwhile, is mostly text-oriented and poorly covers attacks through vision, and retraining a model for each new vulnerability class is too expensive. So I was interested in inference-time defenses, the ones that change only the inference pipeline (input, prompt, repeated runs, output filtering) while leaving the weights untouched. The trouble is that each such method had previously been tested in isolation, on its own model and its own benchmark. We were the first to bring them onto one bench.
What and how we tested
We took three defenses at different stages: RapGuard (an adaptive defensive prompt driven by chain-of-thought reasoning), AdaShield (a prompt against text hidden in an image), and SmoothVLM (randomized smoothing that masks pixels and takes a majority vote over several runs). We compared six configurations, from "no defense" and a simple safety prompt up to the full S+A+R combination, across eight models (4B to 38B parameters) and seven benchmarks covering typographic injection, textual and multimodal jailbreaks, and adversarial patches. To keep the comparison fair across very different benchmarks, we ran everything through the same keyword classifier and measured three numbers: the harmless rate (HR), the attack success rate (ASR), and the over-refusal rate on legitimate queries.
Five findings
The picture turned out to be down-to-earth. First: there is no universal defense, and what works depends on both a model's baseline safety and the attack type. Second: stacking everything is a bad idea. The full combination pushes over-refusal on benign queries to between 97% and 100%, and SmoothVLM on its own gives false-refusal rates of 99.2% to 100%, which makes the system unusable. Third: a simple safety prompt keeps utility intact (0 to 18.2% over-refusal, below 7% for five of the eight models) while still raising safety moderately, which makes it a good lightweight base layer. Fourth: different attack classes expose different weaknesses (typographic attacks hit InternVL, textual jailbreaks hit Qwen3-VL), so you cannot judge a defense on one benchmark. Fifth: in a preliminary whitebox PGD test, text-level defenses unexpectedly suppressed a gradient-based visual attack (ASR 25% to 0%), because they act at the output stage, where pixel-space optimization has limited reach.
What follows from this
The main practical lesson for me: there is no single "silver bullet," and a defense has to be chosen adaptively, fitted to the specific model and query stream rather than piled on indiscriminately. This comparative study is the empirical and methodological foundation for the adaptive MARS framework (part of B. Nutfullin's candidate dissertation), which selects and composes inference-time defenses to fit the situation. For me it continues the line of work on LLM safety, prompt injection, and operational security in multimodal and agentic AI.