VLA BENCHMARK · 4 MODELS

VLA Leaderboard: 4 Models on RoboGate's 68-Scenario Suite

27M → 93M → 3B → 7B — all 0% SR. Including NVIDIA's official GR00T N1.6. Model size is not the bottleneck; training-deployment distribution mismatch is.

Leaderboard

All models evaluated on identical 68 scenarios · Franka Panda · Isaac Sim 5.1

ModelParamsSRResultConf.
Scripted Controller100%68/6876
GR00T N1.6 (NVIDIA)NEW3B0.0%0/681
OpenVLA (Stanford + TRI)7B0.0%0/6827
Octo-Base (UC Berkeley)93M0.0%0/681
Octo-Small (UC Berkeley)27M0.0%0/681

Key Insight

Scaling from 27M to 7B (260×) parameters yields zero improvement — and even NVIDIA's official GR00T N1.6 (3B) scores 0%. The failure is not capacity — it's the distribution gap between robot pre-training data and RoboGate's adversarial industrial scenarios.

Scale tested

27M → 7B (260×)

NVIDIA Official

GR00T N1.6 (3B)

Improvement

0%

vs Scripted

100-point gap

Models Evaluated

Four VLA models — including NVIDIA's official GR00T N1.6 — evaluated on the same 68 adversarial scenarios via two-process ZMQ pipeline (Isaac Sim ↔ VLA inference).

GR00T N1.6

3B

NVIDIA

0%SR|1/100

NVIDIA's official 3B foundation model for humanoid and manipulation. Built on Eagle-2 vision encoder + Llama backbone with large-scale robot pre-training. Despite being the industry's flagship VLA from the GPU leader, 0% SR with both grasp_miss and collision failures — proving even tier-1 vendors cannot bridge the sim-to-real gap on adversarial scenarios.

OpenVLA

7B

Stanford + Toyota Research Institute

0%SR|27/100

Open-source 7B VLA from Stanford + Toyota Research Institute. Built on Llama-2 backbone, fine-tuned on Open X-Embodiment. The largest model tested — yet 0% SR with a different failure profile: primarily grasp_miss with zero collisions, suggesting better spatial awareness but still unable to complete tasks.

Octo-Base

93M

UC Berkeley

0%SR|1/100

93M parameter version of Octo from UC Berkeley. Trained on 800K episodes from Open X-Embodiment. 3.4× larger than Octo-Small but identical 0% SR and nearly identical failure distribution.

Octo-Small

27M

UC Berkeley

0%SR|1/100

27M parameter lightweight VLA from UC Berkeley. The smallest and fastest model. Same 0% result with 79.4% grasp_miss and 20.6% collision failures.

Scripted vs VLA

Scripted Baseline

100%

68/68 PASS · Confidence 76/100

Octo-Small VLA (4 models)

0.0%

0/68 FAIL · Best Confidence: 27/100 (OpenVLA)

100-point gap

Why Confidence Scores Differ

Same 0% SR but different failure patterns yield different Confidence Scores

1/100 — GR00T N1.6NVIDIA

NVIDIA's official 3B model. Collisions present + grasp_miss dominant. Despite large-scale robot pre-training, complete failure on industrial adversarial scenarios

27/100 — OpenVLA

Zero collisions — spatial awareness exists but unable to grasp. Higher confidence means 'safe but incapable'

1/100 — Octo (both)

20%+ collision rate — crashes into table/obstacles. Low confidence means 'incapable and dangerous'

Methodology

01Two-process ZMQ pipeline: Isaac Sim server (Python 3.11, mss camera) ↔ Octo client (Python 3.10, JAX 0.4)
02Camera: mss screen capture 256×256 RGB at 20Hz
03Action: 7-DOF delta EE pose (pos ×0.02m, rot ×0.05rad) → IK solver → joint targets
04Episode: max 300 steps, success = object within 3cm of target
05Inference: 19,038 total inferences across 68 scenarios

Citation

If you use this VLA benchmark in your research:

@misc{kim2026robogate,
  title   = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
             Policy Deployment via Two-Stage Boundary-Focused Sampling},
  author  = {{AgentAI Co., Ltd.}},
  year    = {2026},
  doi     = {10.5281/zenodo.19166967},
  url     = {https://robogate.io/paper},
  note    = {VLA Benchmark: GR00T N1.6 0/68, OpenVLA 0/68, Octo-Base 0/68, Octo-Small 0/68}
}

Test Your Own VLA Model

RoboGate's 68-scenario suite is open-source. Run your VLA model against the same adversarial conditions.