Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Yifan Li^1*, Lichi Li^6*, Anh Dao^1*, Xinyu Zhou¹, Yicheng Qiao¹, Zheda Mai³, Daeun Lee², Zichen Chen⁴,
Zhen Tan⁵, Mohit Bansal², Yu Kong¹

¹Michigan State University, ²UNC Chapel Hill, ³Ohio State University, ⁴UC Santa Barbara, ⁵Arizona State University, ⁶Independent Researcher

^*Equal Contribution

Paper arXiv Video Code Data

IndustryNav provides a zero-shot navigation setting where an embodied agent is prompted with an egocentric image, global odometry, and action–state history to reach a target while avoiding dynamic obstacles.

Agent Trajectory Outputs

We visualize the navigation behavior of each embodied agent using synchronized egocentric and top-down trajectory GIFs.

GPT-4o
GPT-5-mini
Claude-Haiku-4.5
Claude-Sonnet-4.5
Gemini-2.5-flash
Qwen3-VL-8B-Instruct
Qwen3-VL-30B-A3B-Instruct
LLaMA-4-Scout
Nemotron-nano-12B-v2-VL

GPT-4o

GPT-5-mini

Claude-Haiku-4.5

Claude-Sonnet-4.5

Gemini-2.5-flash

Qwen3-VL-8B-Instruct

Qwen3-VL-30B-A3B-Instruct

LLaMA-4-Scout

Nemotron-nano-12B-v2-VL

Abstract

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity.

To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning.

Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration.

Agent Viewpoints

To capture both local and global information, we employ a multi-sensor tracking setup. The agent utilizes an Egocentric Camera for immediate perception and collision avoidance, while a Top-Down View provides global reference for trajectory monitoring.

Agent Cameras — **Left:** Egocentric view (what the agent sees). **Right:** Top-down global odometry view.

Evaluation Metrics

We propose five metrics across three dimensions: Success, Efficiency, and Safety.

Notation:

\( N \): Total number of runs.
\( d_i \): Final distance to target in run \(i\).
\( D_i \): Initial distance to target in run \(i\).
\( T_i \): Total steps taken in run \(i\).
\( \delta \): Success threshold distance.

Success Ratio (SR)

Percentage of successful runs where the agent reaches the target.

\[ \mathrm{SR} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[d_i \leq \delta] \]

Distance Ratio (DR)

Quantifies the progress toward the target relative to the initial distance.

\[ \mathrm{DR}=\frac{1}{N}\sum_{i=1}^N{\frac{D_i-d_i}{D_i}} \]

Average Steps (AS)

Evaluates trajectory efficiency. Lower is better.

\[ \mathrm{AS} = \frac{1}{N} \sum_{i=1}^{N} T_i \]

Collision Ratio (CR)

Proportion of forward actions resulting in a collision (\(C_i\)) out of total forward actions (\(F_i\)).

\[ \mathrm{CR}=\frac{1}{N}\sum_{i=1}^N{\frac{C_i}{F_i}} \]

Warning Ratio (WR)

Measures frequency of warning states (\(W_i\)) per step based on depth map analysis.

\[ \mathrm{WR} = \frac{1}{N}\sum_{i=1}^N{\frac{W_i}{T_i}} \]

Warning Detection System

Safety is paramount in industrial settings. We implement a warning system using Depth Pro to estimate distance. A warning is triggered when the minimum depth values within the Region of Interest (RoI) fall below a predefined safety threshold.

Warning Detection Demo — Example of warning detection: Proximity to machinery and obstacles triggers a safety warning.

Experimental Results

Comparison of nine state-of-the-art VLLMs. Red indicates best Closed-Source performance, Blue indicates best Open-Source performance.

Embodied Agents	Task Success (%)		Efficiency	Safety (%)
Embodied Agents	Success Ratio ↑	Distance Ratio ↑	Avg Steps ↓	Collision Ratio ↓	Warning Ratio ↓
Closed-Source VLLMs
GPT-4o	21.53	49.41	66.76	7.86	13.45
GPT-5-mini	54.17	81.90	49.91	16.89	24.13
Claude-Haiku-4.5	61.81	82.87	46.80	32.18	31.57
Claude-Sonnet-4.5	61.81	86.26	47.33	27.68	31.52
Gemini-2.5-flash	65.28	84.49	45.95	32.14	37.16
Open-Source VLLMs
Qwen3-VL-8b-Instruct	4.86	27.05	67.22	27.82	25.70
Qwen3-VL-30b-A3B	6.25	26.20	66.70	18.97	26.28
LLaMA-4-Scout	15.28	56.40	61.53	24.38	35.06
Nemotron-nano-12b	55.56	80.48	50.69	31.73	36.54

Key Findings

Our comprehensive evaluation of nine state-of-the-art VLLMs reveals several critical insights about spatial reasoning in dynamic industrial environments.

No VLLM Can Reliably Navigate

Across all nine models, the Task Success Ratio remains low (less than 70%), and none of them can consistently reach the target across all scenarios. This highlights that spatial reasoning and long-horizon navigation in dynamic industrial environments remain fundamentally challenging for current VLLMs.

Closed-Source Models Lead

Closed-source VLLMs consistently outperform open-source counterparts across task success and efficiency metrics. They achieve substantially higher success rates and generally require fewer steps to reach targets, reflecting superior planning and route optimization. Most open-source VLLMs, particularly Qwen3-VL and LLaMA-4, remain less competitive for complex navigation tasks.

Nemotron Stands Out

Among open-source VLLMs, Nemotron-nano-12B-v2-VL stands out with a 55.56% Success Ratio, approaching closed-source performance levels. It demonstrates reasonable efficiency and moderate safety scores, making it the most promising open-source baseline for spatial reasoning in industrial navigation tasks.

Safety Remains Critical

Both closed-source and open-source VLLMs show high Collision and Warning Ratios, highlighting substantial deficiencies in hazard perception, navigating around dynamic obstacles, and maintaining consistent collision avoidance. All VLLMs remain far from achieving the level of safety required for real-world deployment.

Core Reasoning Deficiencies

Case analysis reveals that current VLLMs still struggle with three fundamental capabilities:

Robust Global Path Planning: Agents fail to recognize blocked paths and do not replan routes when obstacles are encountered.
Active Exploration: Models often get stuck in action loops instead of exploring alternatives, overlooking repeated action-state patterns.
Precise Distance Estimation: Agents frequently misperceive clearances, believing paths are clear when obstacles exist, leading to collisions.

Key Takeaway: These findings highlight a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environments.

Ablation Studies

To evaluate the effectiveness of key components in our navigation pipeline, we conduct ablation studies on three closed-source VLLMs: Claude-Sonnet-4.5, Gemini-2.5-flash, and GPT-5-mini.

Effect of Action-State Histories

We remove action-state histories to assess their impact on navigation performance. These histories provide temporal context that helps the agent detect repeated failures, leading to improved navigation success. Without this information, the agent is more prone to making short-sighted decisions and repeating ineffective actions, leading to lower success rates and less stable trajectories. Results demonstrate the effectiveness of history in increasing the success ratio and reducing collision rates.

Necessity of Top-Down View Map

Although global odometry provides precise information about the target's relative position, it lacks obstacle layout information essential for global planning. To assess whether additional spatial context is beneficial, we augment the pipeline with a top-down view map. Results indicate that incorporating the top-down map does not improve task success or reduce warning ratios, except for a modest gain in GPT-5-mini's success rate. We attribute this to VLLMs' limited ability to interpret top-down maps and the additional noise such representations introduce compared to clean odometry signals.

Dynamic vs. Static Scenarios

To evaluate the influence of dynamic objects and humans in the scenario, we compare performance between dynamic and static environments. The Success Ratio and Distance Ratio are lower in dynamic environments, demonstrating that moving obstacles (forklifts, walking workers) introduce complexity and uncertainty that embodied agents struggle to handle. The Collision Ratio and Warning Ratio are significantly higher in dynamic settings, especially for GPT-5-mini, suggesting agents fail to anticipate or react quickly to changing obstacle positions. These results confirm that navigation in dynamic industrial scenarios remains substantially more challenging.

Qualitative Analysis

We analyze representative cases from GPT-5-mini. The agent exhibits signs of spatial reasoning by identifying blocked paths but often fails in global planning and active exploration, getting stuck in action loops.

Qualitative results GPT-5-mini — Illustration of both correct (first row) and incorrect (second and third rows) action behaviors of GPT-5-mini under the IndustryNav scenario.

BibTeX

If you find our work useful, please consider citing:

@article{li2025industrynav,
  title={IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation},
  author={Li, Yifan and Li, Lichi and Dao, Anh and Zhou, Xinyu and Qiao, Yicheng and Mai, Zheda and Lee, Daeun and Chen, Zichen and Tan, Zhen and Bansal, Mohit and Kong, Yu},
  journal={arXiv preprint},
  year={2025}
}