A vision-centric manipulation setting where object and target spatial prompts in the first egocentric frame condition future end-effector trajectory prediction.
Robotic manipulation is often specified through language or task IDs, but cluttered egocentric scenes with visually similar objects are often better described by directly indicating which object to manipulate and where it should go.
We formalize this setting as Spatially Prompted Visual Trajectory Prediction (SP-VTP): given object and target spatial prompts on the first frame, a model predicts future relative end-effector trajectories from streaming egocentric observations.
We introduce EgoSPT, an egocentric manipulation dataset with first-frame grounding annotations and recovered 3D end-effector motion, and propose SPOT, a prompt-centric policy that fuses rendered visual prompts, coordinate prompt tokens, current observations, and action history to generate trajectory chunks.
EgoSPT is collected with a modified Universal Manipulation Interface. Each episode contains egocentric video, object and target bounding-box annotations in the initial frame, and recovered end-effector poses. The scene-aware validation protocol keeps correlated scene units out of the training split, making evaluation sensitive to cross-scene generalization rather than layout memorization.
Example egocentric manipulation videos from three scenes in EgoSPT.
put fork1 to bowl1
put fork1 to bowl2
put fork1 to bowl3
put fork1 to bowl1
put fork1 to bowl2
put fork1 to bowl3
subscene1
subscene10
subscene11
Each processed demo synchronizes the egocentric video with recovered pose motion and gripper-width signals.
put fork1 to bowl1
put fork1 to bowl2
put fork1 to bowl3
put fork1 to cup1
put fork1 to cup2
put fork1 to plate1
Encodes the first-frame visual prompt and coordinate prompt tokens, then lets object and target tokens attend to the first-frame visual features.
Processes the current egocentric frame and recent trajectory history so the policy can infer the current execution phase.
Generates a future chunk of relative end-effector actions with translation, 6D rotation, and gripper width.
Under strict scene-level validation splits, SPOT benefits from spatial task grounding, DINOv2 visual features, action history, and a flow-matching trajectory head. Lower is better for all trajectory error metrics.
EgoSPT uses a lightweight annotation workflow for first-frame object and target bounding boxes. An annotation tool records the two boxes, while a modification tool supports manual inspection and correction before training samples are built.
@article{li2026spvtp,
title={Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation},
author={Li, Yifan and Zhou, Xinyu and Ge, Yunhao and Kong, Yu},
journal={arXiv preprint},
year={2026}
}