Logo

Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

1Michigan State University, 2UNC Chapel Hill, 3Ohio State University, 4UC Santa Barbara, 5Arizona State University, 6Independent Researcher
*Equal Contribution
IndustryNav Introduction

IndustryNav provides a zero-shot navigation setting where an embodied agent is prompted with an egocentric image, global odometry, and action–state history to reach a target while avoiding dynamic obstacles.

Agent Trajectory Outputs

We visualize the navigation behavior of each embodied agent using synchronized egocentric and top-down trajectory GIFs.

GPT-4o

GPT-4o egocentric trajectory Egocentric View
GPT-4o top-down trajectory Top-Down View

GPT-5-mini

GPT-5-mini egocentric trajectory Egocentric View
GPT-5-mini top-down trajectory Top-Down View

Claude-Haiku-4.5

Claude-Haiku-4.5 egocentric trajectory Egocentric View
Claude-Haiku-4.5 top-down trajectory Top-Down View

Claude-Sonnet-4.5

Claude-Sonnet-4.5 egocentric trajectory Egocentric View
Claude-Sonnet-4.5 top-down trajectory Top-Down View

Gemini-2.5-flash

Gemini-2.5-flash egocentric trajectory Egocentric View
Gemini-2.5-flash top-down trajectory Top-Down View

Qwen3-VL-8B-Instruct

Qwen3-VL-8B-Instruct egocentric trajectory Egocentric View
Qwen3-VL-8B-Instruct top-down trajectory Top-Down View

Qwen3-VL-30B-A3B-Instruct

Qwen3-VL-30B-A3B-Instruct egocentric trajectory Egocentric View
Qwen3-VL-30B-A3B-Instruct top-down trajectory Top-Down View

LLaMA-4-Scout

LLaMA-4-Scout egocentric trajectory Egocentric View
LLaMA-4-Scout top-down trajectory Top-Down View

Nemotron-nano-12B-v2-VL

Nemotron-nano-12B-v2-VL egocentric trajectory Egocentric View
Nemotron-nano-12B-v2-VL top-down trajectory Top-Down View

Abstract

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity.

To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning.

Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration.

Agent Viewpoints

To capture both local and global information, we employ a multi-sensor tracking setup. The agent utilizes an Egocentric Camera for immediate perception and collision avoidance, while a Top-Down View provides global reference for trajectory monitoring.

Agent Cameras
Left: Egocentric view (what the agent sees). Right: Top-down global odometry view.

Evaluation Metrics

We propose five metrics across three dimensions: Success, Efficiency, and Safety.

Notation:
  • \( N \): Total number of runs.
  • \( d_i \): Final distance to target in run \(i\).
  • \( D_i \): Initial distance to target in run \(i\).
  • \( T_i \): Total steps taken in run \(i\).
  • \( \delta \): Success threshold distance.

Warning Detection System

Safety is paramount in industrial settings. We implement a warning system using Depth Pro to estimate distance. A warning is triggered when the minimum depth values within the Region of Interest (RoI) fall below a predefined safety threshold.

Warning Detection Demo
Example of warning detection: Proximity to machinery and obstacles triggers a safety warning.

Experimental Results

Comparison of nine state-of-the-art VLLMs. Red indicates best Closed-Source performance, Blue indicates best Open-Source performance.

Embodied Agents Task Success (%) Efficiency Safety (%)
Success Ratio ↑ Distance Ratio ↑ Avg Steps ↓ Collision Ratio ↓ Warning Ratio ↓
Closed-Source VLLMs
GPT-4o 21.53 49.41 66.76 7.86 13.45
GPT-5-mini 54.17 81.90 49.91 16.89 24.13
Claude-Haiku-4.5 61.81 82.87 46.80 32.18 31.57
Claude-Sonnet-4.5 61.81 86.26 47.33 27.68 31.52
Gemini-2.5-flash 65.28 84.49 45.95 32.14 37.16
Open-Source VLLMs
Qwen3-VL-8b-Instruct 4.86 27.05 67.22 27.82 25.70
Qwen3-VL-30b-A3B 6.25 26.20 66.70 18.97 26.28
LLaMA-4-Scout 15.28 56.40 61.53 24.38 35.06
Nemotron-nano-12b 55.56 80.48 50.69 31.73 36.54

Key Findings

Our comprehensive evaluation of nine state-of-the-art VLLMs reveals several critical insights about spatial reasoning in dynamic industrial environments.

No VLLM Can Reliably Navigate

Across all nine models, the Task Success Ratio remains low (less than 70%), and none of them can consistently reach the target across all scenarios. This highlights that spatial reasoning and long-horizon navigation in dynamic industrial environments remain fundamentally challenging for current VLLMs.

Closed-Source Models Lead

Closed-source VLLMs consistently outperform open-source counterparts across task success and efficiency metrics. They achieve substantially higher success rates and generally require fewer steps to reach targets, reflecting superior planning and route optimization. Most open-source VLLMs, particularly Qwen3-VL and LLaMA-4, remain less competitive for complex navigation tasks.

Nemotron Stands Out

Among open-source VLLMs, Nemotron-nano-12B-v2-VL stands out with a 55.56% Success Ratio, approaching closed-source performance levels. It demonstrates reasonable efficiency and moderate safety scores, making it the most promising open-source baseline for spatial reasoning in industrial navigation tasks.

Safety Remains Critical

Both closed-source and open-source VLLMs show high Collision and Warning Ratios, highlighting substantial deficiencies in hazard perception, navigating around dynamic obstacles, and maintaining consistent collision avoidance. All VLLMs remain far from achieving the level of safety required for real-world deployment.

Core Reasoning Deficiencies

Case analysis reveals that current VLLMs still struggle with three fundamental capabilities:

  • Robust Global Path Planning: Agents fail to recognize blocked paths and do not replan routes when obstacles are encountered.
  • Active Exploration: Models often get stuck in action loops instead of exploring alternatives, overlooking repeated action-state patterns.
  • Precise Distance Estimation: Agents frequently misperceive clearances, believing paths are clear when obstacles exist, leading to collisions.

Key Takeaway: These findings highlight a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environments.

Ablation Studies

To evaluate the effectiveness of key components in our navigation pipeline, we conduct ablation studies on three closed-source VLLMs: Claude-Sonnet-4.5, Gemini-2.5-flash, and GPT-5-mini.

Qualitative Analysis

We analyze representative cases from GPT-5-mini. The agent exhibits signs of spatial reasoning by identifying blocked paths but often fails in global planning and active exploration, getting stuck in action loops.

Qualitative results GPT-5-mini
Illustration of both correct (first row) and incorrect (second and third rows) action behaviors of GPT-5-mini under the IndustryNav scenario.

BibTeX

If you find our work useful, please consider citing:

@article{li2025industrynav,
  title={IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation},
  author={Li, Yifan and Li, Lichi and Dao, Anh and Zhou, Xinyu and Qiao, Yicheng and Mai, Zheda and Lee, Daeun and Chen, Zichen and Tan, Zhen and Bansal, Mohit and Kong, Yu},
  journal={arXiv preprint},
  year={2025}
}