While Visual Large Language Models (VLLMs) show great promise as
embodied agents, they continue to face substantial challenges in
spatial reasoning. Existing embodied benchmarks largely focus on
passive, static household environments and evaluate only
isolated capabilities, failing to capture holistic performance
in dynamic, real-world complexity.
To fill this gap, we present IndustryNav, the
first dynamic industrial navigation benchmark for active spatial
reasoning. IndustryNav leverages 12 manually created,
high-fidelity Unity warehouse scenarios featuring dynamic
objects and human movement. Our evaluation employs a PointGoal
navigation pipeline that effectively combines egocentric vision
with global odometry to assess holistic local-global planning.
Crucially, we introduce the
"collision rate" and
"warning rate" metrics to
measure safety-oriented behaviors and distance estimation. A
comprehensive study of nine state-of-the-art VLLMs (including
models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals
that closed-source models maintain a consistent advantage;
however, all agents exhibit notable deficiencies in robust path
planning, collision avoidance and active exploration.