27M → 93M → 3B → 7B — all 0% SR. Including NVIDIA's official GR00T N1.6. Model size is not the bottleneck; training-deployment distribution mismatch is.
All models evaluated on identical 68 scenarios · Franka Panda · Isaac Sim 5.1
| Model | Params | SR | Result | Conf. |
|---|---|---|---|---|
| Scripted Controller | — | 100% | 68/68 | 76 |
| GR00T N1.6 (NVIDIA)NEW | 3B | 0.0% | 0/68 | 1 |
| OpenVLA (Stanford + TRI) | 7B | 0.0% | 0/68 | 27 |
| Octo-Base (UC Berkeley) | 93M | 0.0% | 0/68 | 1 |
| Octo-Small (UC Berkeley) | 27M | 0.0% | 0/68 | 1 |
Scaling from 27M to 7B (260×) parameters yields zero improvement — and even NVIDIA's official GR00T N1.6 (3B) scores 0%. The failure is not capacity — it's the distribution gap between robot pre-training data and RoboGate's adversarial industrial scenarios.
Scale tested
27M → 7B (260×)
NVIDIA Official
GR00T N1.6 (3B)
Improvement
0%
vs Scripted
100-point gap
Four VLA models — including NVIDIA's official GR00T N1.6 — evaluated on the same 68 adversarial scenarios via two-process ZMQ pipeline (Isaac Sim ↔ VLA inference).
NVIDIA
NVIDIA's official 3B foundation model for humanoid and manipulation. Built on Eagle-2 vision encoder + Llama backbone with large-scale robot pre-training. Despite being the industry's flagship VLA from the GPU leader, 0% SR with both grasp_miss and collision failures — proving even tier-1 vendors cannot bridge the sim-to-real gap on adversarial scenarios.
Stanford + Toyota Research Institute
Open-source 7B VLA from Stanford + Toyota Research Institute. Built on Llama-2 backbone, fine-tuned on Open X-Embodiment. The largest model tested — yet 0% SR with a different failure profile: primarily grasp_miss with zero collisions, suggesting better spatial awareness but still unable to complete tasks.
UC Berkeley
93M parameter version of Octo from UC Berkeley. Trained on 800K episodes from Open X-Embodiment. 3.4× larger than Octo-Small but identical 0% SR and nearly identical failure distribution.
UC Berkeley
27M parameter lightweight VLA from UC Berkeley. The smallest and fastest model. Same 0% result with 79.4% grasp_miss and 20.6% collision failures.
Scripted Baseline
68/68 PASS · Confidence 76/100
Octo-Small VLA (4 models)
0/68 FAIL · Best Confidence: 27/100 (OpenVLA)
100-point gap
Same 0% SR but different failure patterns yield different Confidence Scores
NVIDIA's official 3B model. Collisions present + grasp_miss dominant. Despite large-scale robot pre-training, complete failure on industrial adversarial scenarios
Zero collisions — spatial awareness exists but unable to grasp. Higher confidence means 'safe but incapable'
20%+ collision rate — crashes into table/obstacles. Low confidence means 'incapable and dangerous'
If you use this VLA benchmark in your research:
@misc{kim2026robogate,
title = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
Policy Deployment via Two-Stage Boundary-Focused Sampling},
author = {{AgentAI Co., Ltd.}},
year = {2026},
doi = {10.5281/zenodo.19166967},
url = {https://robogate.io/paper},
note = {VLA Benchmark: GR00T N1.6 0/68, OpenVLA 0/68, Octo-Base 0/68, Octo-Small 0/68}
}RoboGate's 68-scenario suite is open-source. Run your VLA model against the same adversarial conditions.