Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

A vision-centric manipulation setting where object and target spatial prompts in the first egocentric frame condition future end-effector trajectory prediction.

1Michigan State University, 2NVIDIA Research
*Equal contribution
SP-VTP overview

We study how first-frame spatial prompts can ground egocentric manipulation and drive future 3D end-effector trajectory prediction.

SP-VTP

We formulate egocentric manipulation as spatially prompted visual trajectory prediction from first-frame object-target grounding.

EgoSPT

We collect egocentric manipulation trajectories with spatial prompt annotations and recovered 3D end-effector motion.

SPOT

We propose a prompt-centric policy that fuses task prompts, current observations, and history to generate future trajectories.

Metrics & Protocol

We evaluate scene-level generalization with complementary trajectory, rotation, gripper, and final-displacement metrics.

Abstract

Robotic manipulation is often specified through language or task IDs, but cluttered egocentric scenes with visually similar objects are often better described by directly indicating which object to manipulate and where it should go.

We formalize this setting as Spatially Prompted Visual Trajectory Prediction (SP-VTP): given object and target spatial prompts on the first frame, a model predicts future relative end-effector trajectories from streaming egocentric observations.

We introduce EgoSPT, an egocentric manipulation dataset with first-frame grounding annotations and recovered 3D end-effector motion, and propose SPOT, a prompt-centric policy that fuses rendered visual prompts, coordinate prompt tokens, current observations, and action history to generate trajectory chunks.

EgoSPT Dataset

EgoSPT overview

EgoSPT grounds manipulation goals with lightweight first-frame object-target prompts and supervises future 3D end-effector trajectory chunks from egocentric observations.

2,841 egocentric episodes
112,856 trajectory samples
3 scene-level splits

EgoSPT is collected with a modified Universal Manipulation Interface. Each episode contains egocentric video, object and target bounding-box annotations in the initial frame, and recovered end-effector poses. The scene-aware validation protocol keeps correlated scene units out of the training split, making evaluation sensitive to cross-scene generalization rather than layout memorization.

Modified UMI collection setup
Modified UMI setup for collecting egocentric manipulation trajectories.

Dataset Demos

Example egocentric manipulation videos from three scenes in EgoSPT.

Scene 1

put fork1 to bowl1

put fork1 to bowl2

put fork1 to bowl3

Scene 2

put fork1 to bowl1

put fork1 to bowl2

put fork1 to bowl3

Scene 3

subscene1

subscene10

subscene11

Processed Data Visualizations

Each processed demo synchronizes the egocentric video with recovered pose motion and gripper-width signals.

put fork1 to bowl1

put fork1 to bowl2

put fork1 to bowl3

put fork1 to cup1

put fork1 to cup2

put fork1 to plate1

SPOT Policy

Task Encoder Observation Encoder Trajectory Generator
SPOT policy architecture

Task Encoder

Encodes the first-frame visual prompt and coordinate prompt tokens, then lets object and target tokens attend to the first-frame visual features.

Observation Encoder

Processes the current egocentric frame and recent trajectory history so the policy can infer the current execution phase.

Trajectory Generator

Generates a future chunk of relative end-effector actions with translation, 6D rotation, and gripper width.

Experimental Results

Under strict scene-level validation splits, SPOT benefits from spatial task grounding, DINOv2 visual features, action history, and a flow-matching trajectory head. Lower is better for all trajectory error metrics.

Trajectory Head

Trajectory head ablation

History Horizon

History horizon ablation

Backbone Tuning

DINOv2 tuning ablation

Trajectory Visualizations

First chunk trajectory visualization
First-chunk prediction visualizes immediate local motion and horizon-wise position, rotation, and gripper errors.
Full trajectory visualization
Full-episode stitching shows how local predictions compose into a complete manipulation trajectory.

More Scenes

Scene 1 trajectory
Scene 1. Stitched full-trajectory predictions show how local chunks compose over complete episodes.
Scene 2 trajectory
Scene 2. Stitched full-trajectory predictions show how local chunks compose over complete episodes.
Scene 3 trajectory
Scene 3. Stitched full-trajectory predictions show how local chunks compose over complete episodes.

Annotation Tools

EgoSPT uses a lightweight annotation workflow for first-frame object and target bounding boxes. An annotation tool records the two boxes, while a modification tool supports manual inspection and correction before training samples are built.

Annotation tool
Annotation modification tool

BibTeX

@article{li2026spvtp,
  title={Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation},
  author={Li, Yifan and Zhou, Xinyu and Ge, Yunhao and Kong, Yu},
  journal={arXiv preprint},
  year={2026}
}