StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

Fu, Yiyang

StableVLA: Towards Robust Vision-Language-Action Models
without Extra Data

14× smaller backbone · 80× smaller training data · state-of-the-art robustness

Yiyang Fu¹, Chubin Zhang^2,3, Shukai Gong¹, Yufan Deng¹, Kaiwei Sun⁴, Qiyang Min,
Qibin Hou⁵, Yansong Tang², Jianan Wang³, Daquan Zhou^1†

¹Peking University ²Tsinghua University ³Astribot ⁴Nanjing University ⁵Nankai University
^†Corresponding author

ICML 2026

Paper Code Model

Simulation and real-robot demonstrations of StableVLA under various visual corruptions.

StableVLA teaser: robustness comparison and semantic clustering visualization

(a) Robustness comparison. Triangle marks and bar charts denote performance on clean and corrupted data (averaged across severities), respectively. StableVLA achieves state-of-the-art zero-shot robustness with a 0.5B backbone, rivaling 7B-scale models. (b) Semantic Clustering Visualization. K-means clustering (K=2) of projector features under severe corruption. StableVLA maintains coherent, object-centric semantic grouping on both simulation and real-world tasks, whereas the baseline fails to isolate task-relevant regions.

+35.2%
avg. improvement in simulation (vs. VLA-Adapter)
+31.7pp
improvement on real-robot tasks
<10M
extra parameters added
14×
smaller backbone than OpenVLA-OFT

Abstract

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced.

To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness.

Furthermore, even with a 14× smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under real-world visual disturbances.

VLA Robustness Information Bottleneck Zero-shot Data-free Real-robot

Motivation

Performance drop under visual corruptions

VLA models are highly vulnerable to visual corruptions. A model achieving 96% success rate on clean data can drop to near 0% under severe blur. This fragility is shared across VLA-Adapter, OpenVLA, OpenVLA-OFT, and OpenPi-0.5.

Root Cause: The Projector

Through empirical analysis, we identify that the primary source of vulnerability is the projector module that bridges the vision encoder and the LLM backbone.

Standard MLP projectors act as all-pass filters, indiscriminately propagating both task-relevant semantics and visual noise into the policy network.

Our solution: enforce the Information Bottleneck (IB) principle at the modality alignment stage — without any extra data.

Method: IB-Adapter

Architecture of IB-Adapter and StableVLA. We replace the standard MLP projector with Fused IB-Adapter, a dual-stream module combining: (1) Covariance Attention: computes a Gram matrix (D×D) to capture global channel correlations; (2) Sigmoid Gating: independently suppresses noisy channels without forcing competition (unlike Softmax); (3) MLP pathway: preserves fine-grained spatial details for precise manipulation.

IB-Adapter module. Channel-wise covariance gating adaptively suppresses noise channels while preserving task-relevant semantic structure.

Key Insight

Unlike spatial-domain attention in ViTs, IB-Adapter operates in the channel dimension:

Semantic features and noise are heterogeneously distributed across channels
Stochastic corruptions are uncorrelated with structural object features → low Gram matrix values
Sigmoid gates independently suppress noisy channels (gate ≈ 0)
Result: clean, object-centric focus before features reach the policy

Simply replacing the MLP projector yields +35.2% on LIBERO.

Simulation Results

Robustness radar chart across 4 LIBERO task suites

Robustness radar chart. StableVLA consistently envelopes VLA-Adapter across all corruption categories in all four LIBERO task suites, while matching or exceeding 7B-scale baselines.

Semantic grouping of Fused IB-Adapter output features. K-means clustering (K=2) applied to projector output features under Impulse Noise (severity 3 and 5). StableVLA maintains consistent, object-centric grouping even at high noise levels, while the MLP baseline's features collapse under corruption.

Table 1: Full comparison on LIBERO and CALVIN. Success rate (%) for LIBERO; avg. completed tasks (max 5) for CALVIN. C = Clean, S3/S4/S5 = Severity 3/4/5. Bold = best, underline = second best.

Training	Method	Spatial				Object				Goal				Long				CALVIN
Training	Method	C	S3	S4	S5	C	S3	S4	S5	C	S3	S4	S5	C	S3	S4	S5	C	S3	S4	S5
OpenX Pretrain	OpenVLA (7B)	80.0	40.9	24.6	14.7	69.6	18.2	10.4	2.7	74.0	38.7	27.0	16.3	55.5	20.5	12.4	7.0	—
OpenX Pretrain	OpenVLA-OFT (7B)	92.6	89.3	84.0	72.1	98.4	82.5	69.2	52.8	96.8	94.5	84.6	70.3	94.4	77.6	61.9	40.3	—
OpenX+Web Co-train	OpenPi–0.5 (3B)	98.4	88.3	79.0	62.4	99.4	97.1	88.4	76.4	97.2	87.2	82.5	64.2	92.0	76.1	65.6	47.7	—
VLM Direct FT	VLA-Adapter (0.5B)	96.0	93.7	83.3	58.5	96.8	71.0	44.1	29.3	97.4	79.5	64.7	47.3	94.4	63.5	41.0	26.2	4.14	2.56	1.89	1.44
VLM Direct FT	StableVLA (0.5B) ★	96.2	94.4	92.1	82.0	98.8	92.4	83.6	70.2	98.0	93.4	85.0	71.9	93.6	76.3	62.4	45.3	4.17	2.77	2.11	1.51

Real-Robot Experiments

Evaluated on Astribot S1 dual-arm robot across 4 tasks under Gaussian noise, defocus blur, oil contamination, and physical shelter.

Qualitative comparison on real-robot tasks. StableVLA maintains manipulation capability under severe physical and digital visual corruptions where VLA-Adapter and π-0.5 fail. Tasks include Pick & Place, Throw Basketball, Pour Water, and Pack the Doll.

Table 2: Real-world robustness evaluation on Astribot S1. "Clean" = absolute success rate (%). All other columns = performance change (Δ pp) relative to clean. Noise & Blur: averaged across severity levels. Oil & Shelter: physical interference. Bold = smallest performance drop (best robustness).

Task	Method	Clean (%)	Noise (Δ)	Blur (Δ)	Oil (Δ)	Shelter (Δ)	Avg (Δ)
Pick & Place	π-0.5	100.0	-63.3	-16.7	-10.0	-30.0	-30.1
	VLA-Adapter	80.0	-66.7	-40.0	-30.0	-60.0	-49.2
	StableVLA ★	80.0	-30.0	-10.0	-10.0	-20.0	-17.5
Throw Basketball	π-0.5	80.0	-60.0	-33.3	-20.0	-30.0	-35.8
	VLA-Adapter	60.0	-53.0	-40.0	-20.0	-40.0	-38.3
	StableVLA ★	60.0	-36.7	-16.7	-10.0	-10.0	-18.4
Pour Water	π-0.5	70.0	-60.0	-20.0	-20.0	-20.0	-30.0
	VLA-Adapter	40.0	-40.0	-30.0	-10.0	-20.0	-25.0
	StableVLA ★	40.0	-23.3	-16.7	0.0	-10.0	-12.5
Pack the Doll	π-0.5	80.0	-63.3	-33.3	-30.0	-40.0	-41.7
	VLA-Adapter	50.0	-40.0	-26.7	-30.0	-30.0	-31.7
	StableVLA ★	60.0	-16.7	-10.0	-20.0	-10.0	-14.2

BibTeX

@inproceedings{fu2026stablevla,
  title     = {StableVLA: Towards Robust Vision-Language-Action Models without Extra Data},
  author    = {Fu, Yiyang and Zhang, Chubin and Gong, Shukai and Deng, Yufan and
               Sun, Kaiwei and Min, Qiyang and Hou, Qibin and Tang, Yansong and
               Wang, Jianan and Zhou, Daquan},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
}