VLA BENCHMARK · 9 MODELS

VLA Leaderboard: 9 Models on RoboGate's 68-Scenario Suite

The Cross-Simulator Gap: GR00T N1.6 scores 97.65% on LIBERO (MuJoCo) but 0% on RoboGate (Isaac Sim). Same model, same robot, same task — 97.65 percentage point gap.

CORE FINDING

Same Model. Same Robot. Same Task.Different Simulator: 97.65% → 0%

LIBERO (MuJoCo)

97.65%

NVIDIA Official

→ 0%

97.65pp gap

RoboGate (Isaac Sim)

Confidence 49/100

GR00T N1.6, NVIDIA's own robot foundation model, achieves 97.65% on LIBERO (MuJoCo). The same model, fine-tuned on the same LIBERO-Spatial dataset, scores 0% on RoboGate's 68 industrial scenarios (Isaac Sim). This 97.65 percentage point gap proves that deployment-environment validation is not optional — it's essential.

Model: GR00T N1.6 (3B)Robot: Franka PandaFine-tune: 20K steps, H100 80GBData: LIBERO-Spatial

Leaderboard

All models evaluated on identical 68 scenarios · Franka Panda · Isaac Sim 5.1

Model	Params	SR	Result	Conf.	Failure Pattern
Scripted Controller	—	100%	68/68	76	—
GR00T N1.6 (LIBERO-finetuned)NEW	3B	0.0%	0/68	49	grasp_miss only, 0 collision — LIBERO 97.65% → 0%
GR00T N1.7 LIBERO-10 (NVIDIA)NEW	3B	0.0%	0/68	27	grasp_miss 100%, 0 collision
GR00T N1.7 LIBERO-Goal (NVIDIA)NEW	3B	0.0%	0/68	27	grasp_miss 100%, 0 collision
GR00T N1.7 LIBERO-Object (NVIDIA)NEW	3B	0.0%	0/68	27	grasp_miss 100%, 0 collision
GR00T N1.7 LIBERO-Spatial (NVIDIA)NEW	3B	0.0%	0/68	27	grasp_miss 100%, 0 collision
PI0 Base (Physical Intelligence)	3.5B	0.0%	0/68	27	grasp_miss dominant, OpenPI inference
OpenVLA (Stanford + TRI)	7B	0.0%	0/68	27	grasp_miss dominant, 0 collision
GR00T N1.6 (base)	3B	0.0%	0/68	1	grasp_miss + collision
SmolVLA Base (HuggingFace)	450M	0.0%	0/68	1	grasp_miss dominant
Octo-Base (UC Berkeley)	93M	0.0%	0/68	1	grasp_miss 79%, collision 21%
Octo-Small (UC Berkeley)	27M	0.0%	0/68	1	grasp_miss 79.4%, collision 20.6%
CogACT (Embodied VLA) — mock	7B	pending	—	—	harness pending
X-VLA — mock	4.5B	pending	—	—	harness pending
OpenVLA-OFT — mock	7B	pending	—	—	harness pending

Key Insight

Scaling from 27M to 7B (260×) parameters yields zero improvement — and even NVIDIA's official GR00T N1.6 (3B) scores 0%. The failure is not capacity — it's the distribution gap between robot pre-training data and RoboGate's adversarial industrial scenarios.

Scale tested

27M → 7B (260×)

PI Official

PI0 Base (3.5B)

NVIDIA Official

GR00T N1.6 (3B)

Improvement

vs Scripted

100-point gap

Models Evaluated

Six VLA models — including Physical Intelligence's PI0, NVIDIA's GR00T N1.6, and HuggingFace's SmolVLA — evaluated on the same 68 adversarial scenarios via two-process ZMQ pipeline (Isaac Sim ↔ VLA inference).

GR00T N1.6 (finetuned)

NVIDIA (LIBERO-Spatial 20K)

0%SR|49/100

GR00T N1.6 fine-tuned on LIBERO-Spatial (20K steps, H100 80GB). Achieves 97.65% SR on LIBERO (MuJoCo) — NVIDIA's official benchmark result. But scores 0% on RoboGate's 68 Isaac Sim scenarios with Confidence 49/100 (zero collisions). The highest Confidence among VLAs, yet still 0% SR — proving the cross-simulator gap.

PI0 Base

3.5B

Physical Intelligence

0%SR|27/100

Physical Intelligence's official 3.5B VLA, evaluated via OpenPI (official inference server). PaliGemma 3B vision + 315M Flow-Matching action expert. Zero collisions like OpenVLA, but 0% SR — the Flow-Matching architecture also cannot bridge the training-deployment distribution gap without fine-tuning.

GR00T N1.6

NVIDIA

0%SR|1/100

NVIDIA's official 3B foundation model for humanoid and manipulation. Built on Eagle-2 vision encoder + Llama backbone with large-scale robot pre-training. Despite being the industry's flagship VLA from the GPU leader, 0% SR with both grasp_miss and collision failures — proving even tier-1 vendors cannot bridge the sim-to-real gap on adversarial scenarios.

OpenVLA

Stanford + Toyota Research Institute

0%SR|27/100

Open-source 7B VLA from Stanford + Toyota Research Institute. Built on Llama-2 backbone, fine-tuned on Open X-Embodiment. The largest model tested — yet 0% SR with a different failure profile: primarily grasp_miss with zero collisions, suggesting better spatial awareness but still unable to complete tasks.

Octo-Base

93M

UC Berkeley

0%SR|1/100

93M parameter version of Octo from UC Berkeley. Trained on 800K episodes from Open X-Embodiment. 3.4× larger than Octo-Small but identical 0% SR and nearly identical failure distribution.

SmolVLA Base

450M

HuggingFace

0%SR|1/100

HuggingFace's 450M parameter VLA built on SmolLM2 language model + SigLIP vision encoder. Designed for efficient on-device deployment. The fastest model tested (18ms/inference) — yet 0% SR, demonstrating that even purpose-built efficient VLAs cannot bridge the training-deployment gap.

Octo-Small

27M

UC Berkeley

0%SR|1/100

27M parameter lightweight VLA from UC Berkeley. The smallest and fastest model. Same 0% result with 79.4% grasp_miss and 20.6% collision failures.

Scripted vs VLA

Scripted Baseline

100%

68/68 PASS · Confidence 76/100

Octo-Small VLA (6 models)

0.0%

0/68 FAIL · Best Confidence: 27/100 (OpenVLA, PI0)

100-point gap

Why Confidence Scores Differ

Same 0% SR but different failure patterns yield different Confidence Scores

27/100 — PI0NEW

Physical Intelligence's official 3.5B model (OpenPI). Zero collisions — same pattern as OpenVLA. Flow-Matching architecture also cannot bridge the distribution gap

1/100 — GR00T N1.6NVIDIA

NVIDIA's official 3B model. Collisions present + grasp_miss dominant. Despite large-scale robot pre-training, complete failure on industrial adversarial scenarios

27/100 — OpenVLA

Zero collisions — spatial awareness exists but unable to grasp. Higher confidence means 'safe but incapable'

1/100 — Octo (both)

20%+ collision rate — crashes into table/obstacles. Low confidence means 'incapable and dangerous'

Methodology

01Two-process ZMQ pipeline: Isaac Sim server (Python 3.11, mss camera) ↔ Octo client (Python 3.10, JAX 0.4)

02Camera: mss screen capture 256×256 RGB at 20Hz

03Action: 7-DOF delta EE pose (pos ×0.02m, rot ×0.05rad) → IK solver → joint targets

04Episode: max 300 steps, success = object within 3cm of target

05Inference: 19,038 total inferences across 68 scenarios

Upcoming Evaluations

Models currently being integrated for evaluation

Cosmos Policy

IN PROGRESS

NVIDIA · 2B · Predict 2.5 · LIBERO 98.33% SOTA

GR00T N1.6 (Real)

EVALUATED

NVIDIA · 3B · LIBERO_PANDA · Isaac-GR00T API

GR00T N1.7 (LIBERO × 4)

EVALUATED

NVIDIA · 3B · libero_{10,goal,object,spatial} · Apache 2.0

GR00T N2

PENDING RELEASE

NVIDIA · DreamZero · Late 2026

The Validation Layer Pattern

On April 14, 2026, NVIDIA released Ising — an open AI model family for quantum computing — and framed it explicitly as "the control plane for quantum machines." The same pattern applies to Physical AI.

	NVIDIA Ising (Quantum)	RoboGate (Physical AI)
Domain	Quantum Computing	Physical AI
Hardware noise	~10⁻³ qubit errors	Physics sim-to-real gap
Validation method	35B VLM + 3D CNN	68-scenario benchmark
Benchmark	QCalEval (6 tests)	RoboGate Bench (68)
Key finding	2.5x faster decoding	97.65% → 0% gap
Target integration	CUDA-Q + NVQLink	Isaac Sim + Arena
Release	HF + GitHub (open)	HF + GitHub (open)

*RoboGate is not affiliated with NVIDIA. This comparison illustrates a structural parallel: both serve as AI-based validation layers for fundamentally noisy systems. NVIDIA Ising is a trademark of NVIDIA Corporation.*

NVIDIA Ising Press Release | NVIDIA Ising Product Page

Citation

If you use this VLA benchmark in your research:

@misc{kim2026robogate,
  title   = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
             Policy Deployment via Two-Stage Boundary-Focused Sampling},
  author  = {{AgentAI Co., Ltd.}},
  year    = {2026},
  doi     = {10.5281/zenodo.19166967},
  url     = {https://robogate.io/paper},
  note    = {VLA Benchmark: 8 VLA models 0/68. Cross-simulator gap: GR00T N1.6 LIBERO 97.65\% → RoboGate 0\%. PI0, SmolVLA, OpenVLA, Octo-Base, Octo-Small also 0/68}
}

Test Your Own VLA Model

RoboGate's 68-scenario suite is open-source. Run your VLA model against the same adversarial conditions.

View on GitHub Paper