VLA BENCHMARK · 9 MODELS

VLA Leaderboard: 9 Models on RoboGate's 68-Scenario Suite

The Cross-Simulator Gap: GR00T N1.6 scores 97.65% on LIBERO (MuJoCo) but 0% on RoboGate (Isaac Sim). Same model, same robot, same task — 97.65 percentage point gap.

CORE FINDING

Same Model. Same Robot. Same Task.Different Simulator: 97.65% → 0%

LIBERO (MuJoCo)

97.65%

NVIDIA Official

→ 0%

97.65pp gap

RoboGate (Isaac Sim)

0%

Confidence 49/100

GR00T N1.6, NVIDIA's own robot foundation model, achieves 97.65% on LIBERO (MuJoCo). The same model, fine-tuned on the same LIBERO-Spatial dataset, scores 0% on RoboGate's 68 industrial scenarios (Isaac Sim). This 97.65 percentage point gap proves that deployment-environment validation is not optional — it's essential.

Model: GR00T N1.6 (3B)Robot: Franka PandaFine-tune: 20K steps, H100 80GBData: LIBERO-Spatial

Leaderboard

All models evaluated on identical 68 scenarios · Franka Panda · Isaac Sim 5.1

ModelParamsSRResultConf.
Scripted Controller100%68/6876
GR00T N1.6 (LIBERO-finetuned)NEW3B0.0%0/6849
GR00T N1.7 LIBERO-10 (NVIDIA)NEW3B0.0%0/6827
GR00T N1.7 LIBERO-Goal (NVIDIA)NEW3B0.0%0/6827
GR00T N1.7 LIBERO-Object (NVIDIA)NEW3B0.0%0/6827
GR00T N1.7 LIBERO-Spatial (NVIDIA)NEW3B0.0%0/6827
PI0 Base (Physical Intelligence)3.5B0.0%0/6827
OpenVLA (Stanford + TRI)7B0.0%0/6827
GR00T N1.6 (base)3B0.0%0/681
SmolVLA Base (HuggingFace)450M0.0%0/681
Octo-Base (UC Berkeley)93M0.0%0/681
Octo-Small (UC Berkeley)27M0.0%0/681
CogACT (Embodied VLA) — mock7Bpending
X-VLA — mock4.5Bpending
OpenVLA-OFT — mock7Bpending

Key Insight

Scaling from 27M to 7B (260×) parameters yields zero improvement — and even NVIDIA's official GR00T N1.6 (3B) scores 0%. The failure is not capacity — it's the distribution gap between robot pre-training data and RoboGate's adversarial industrial scenarios.

Scale tested

27M → 7B (260×)

PI Official

PI0 Base (3.5B)

NVIDIA Official

GR00T N1.6 (3B)

Improvement

0%

vs Scripted

100-point gap

Models Evaluated

Six VLA models — including Physical Intelligence's PI0, NVIDIA's GR00T N1.6, and HuggingFace's SmolVLA — evaluated on the same 68 adversarial scenarios via two-process ZMQ pipeline (Isaac Sim ↔ VLA inference).

GR00T N1.6 (finetuned)

3B

NVIDIA (LIBERO-Spatial 20K)

0%SR|49/100

GR00T N1.6 fine-tuned on LIBERO-Spatial (20K steps, H100 80GB). Achieves 97.65% SR on LIBERO (MuJoCo) — NVIDIA's official benchmark result. But scores 0% on RoboGate's 68 Isaac Sim scenarios with Confidence 49/100 (zero collisions). The highest Confidence among VLAs, yet still 0% SR — proving the cross-simulator gap.

PI0 Base

3.5B

Physical Intelligence

0%SR|27/100

Physical Intelligence's official 3.5B VLA, evaluated via OpenPI (official inference server). PaliGemma 3B vision + 315M Flow-Matching action expert. Zero collisions like OpenVLA, but 0% SR — the Flow-Matching architecture also cannot bridge the training-deployment distribution gap without fine-tuning.

GR00T N1.6

3B

NVIDIA

0%SR|1/100

NVIDIA's official 3B foundation model for humanoid and manipulation. Built on Eagle-2 vision encoder + Llama backbone with large-scale robot pre-training. Despite being the industry's flagship VLA from the GPU leader, 0% SR with both grasp_miss and collision failures — proving even tier-1 vendors cannot bridge the sim-to-real gap on adversarial scenarios.

OpenVLA

7B

Stanford + Toyota Research Institute

0%SR|27/100

Open-source 7B VLA from Stanford + Toyota Research Institute. Built on Llama-2 backbone, fine-tuned on Open X-Embodiment. The largest model tested — yet 0% SR with a different failure profile: primarily grasp_miss with zero collisions, suggesting better spatial awareness but still unable to complete tasks.

Octo-Base

93M

UC Berkeley

0%SR|1/100

93M parameter version of Octo from UC Berkeley. Trained on 800K episodes from Open X-Embodiment. 3.4× larger than Octo-Small but identical 0% SR and nearly identical failure distribution.

SmolVLA Base

450M

HuggingFace

0%SR|1/100

HuggingFace's 450M parameter VLA built on SmolLM2 language model + SigLIP vision encoder. Designed for efficient on-device deployment. The fastest model tested (18ms/inference) — yet 0% SR, demonstrating that even purpose-built efficient VLAs cannot bridge the training-deployment gap.

Octo-Small

27M

UC Berkeley

0%SR|1/100

27M parameter lightweight VLA from UC Berkeley. The smallest and fastest model. Same 0% result with 79.4% grasp_miss and 20.6% collision failures.

Scripted vs VLA

Scripted Baseline

100%

68/68 PASS · Confidence 76/100

Octo-Small VLA (6 models)

0.0%

0/68 FAIL · Best Confidence: 27/100 (OpenVLA, PI0)

100-point gap

Why Confidence Scores Differ

Same 0% SR but different failure patterns yield different Confidence Scores

27/100 — PI0NEW

Physical Intelligence's official 3.5B model (OpenPI). Zero collisions — same pattern as OpenVLA. Flow-Matching architecture also cannot bridge the distribution gap

1/100 — GR00T N1.6NVIDIA

NVIDIA's official 3B model. Collisions present + grasp_miss dominant. Despite large-scale robot pre-training, complete failure on industrial adversarial scenarios

27/100 — OpenVLA

Zero collisions — spatial awareness exists but unable to grasp. Higher confidence means 'safe but incapable'

1/100 — Octo (both)

20%+ collision rate — crashes into table/obstacles. Low confidence means 'incapable and dangerous'

Methodology

01Two-process ZMQ pipeline: Isaac Sim server (Python 3.11, mss camera) ↔ Octo client (Python 3.10, JAX 0.4)
02Camera: mss screen capture 256×256 RGB at 20Hz
03Action: 7-DOF delta EE pose (pos ×0.02m, rot ×0.05rad) → IK solver → joint targets
04Episode: max 300 steps, success = object within 3cm of target
05Inference: 19,038 total inferences across 68 scenarios

Upcoming Evaluations

Models currently being integrated for evaluation

Cosmos Policy

IN PROGRESS

NVIDIA · 2B · Predict 2.5 · LIBERO 98.33% SOTA

GR00T N1.6 (Real)

EVALUATED

NVIDIA · 3B · LIBERO_PANDA · Isaac-GR00T API

GR00T N1.7 (LIBERO × 4)

EVALUATED

NVIDIA · 3B · libero_{10,goal,object,spatial} · Apache 2.0

GR00T N2

PENDING RELEASE

NVIDIA · DreamZero · Late 2026

The Validation Layer Pattern

On April 14, 2026, NVIDIA released Ising — an open AI model family for quantum computing — and framed it explicitly as "the control plane for quantum machines." The same pattern applies to Physical AI.

NVIDIA Ising (Quantum)RoboGate (Physical AI)
DomainQuantum ComputingPhysical AI
Hardware noise~10⁻³ qubit errorsPhysics sim-to-real gap
Validation method35B VLM + 3D CNN68-scenario benchmark
BenchmarkQCalEval (6 tests)RoboGate Bench (68)
Key finding2.5x faster decoding97.65% → 0% gap
Target integrationCUDA-Q + NVQLinkIsaac Sim + Arena
ReleaseHF + GitHub (open)HF + GitHub (open)

*RoboGate is not affiliated with NVIDIA. This comparison illustrates a structural parallel: both serve as AI-based validation layers for fundamentally noisy systems. NVIDIA Ising is a trademark of NVIDIA Corporation.*

NVIDIA Ising Press Release | NVIDIA Ising Product Page

Citation

If you use this VLA benchmark in your research:

@misc{kim2026robogate,
  title   = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
             Policy Deployment via Two-Stage Boundary-Focused Sampling},
  author  = {{AgentAI Co., Ltd.}},
  year    = {2026},
  doi     = {10.5281/zenodo.19166967},
  url     = {https://robogate.io/paper},
  note    = {VLA Benchmark: 8 VLA models 0/68. Cross-simulator gap: GR00T N1.6 LIBERO 97.65\% → RoboGate 0\%. PI0, SmolVLA, OpenVLA, Octo-Base, Octo-Small also 0/68}
}

Test Your Own VLA Model

RoboGate's 68-scenario suite is open-source. Run your VLA model against the same adversarial conditions.