Daily ArXiv / July 08, 2026

Personalized paper radar

A focused reading queue selected from today's ArXiv feed, ranked by topic fit, novelty, and configured author matches.

Relevant papers 12

Top score 15

Average score 12.6

Source ArXiv

Abstract word clouds

Today

actionarchitecturecad-irconditioncorpuscostdensedeploymentdetectiondrivingencoderenvironmentexpertgenerategenerationgroundingillusioninferenceknowledgelocalizationmodel'motionmoworldmultimodalobjectphysicalpipelinepolicypromptreal-worldreasoningregionresponseresultingrewardrobustnesssafetyscenespatialtokenunifiedvisionvision-languagevisualworld

Past month

actionagentalignmentattentionconsistencycontroldetectiondiffusiondomaindrivingdynamiceditingeventevidencefine-grainedfoundationframegenerategenerationgeometricgeometrygroundinginferenceinteractionknowledgelanguagelatentmechanismmemorymotionmultimodalmultipleobjectparadigmpipelinepolicyposepromptreal-worldreasoningreconstructionreferenceregionrewardscenesemanticspacespatialsupervisiontargettemporaltokentrajectoryunderstandingunifiedvideoviewvision-languagevisualworld

Reading Queue

cs.AI

2 papers

Embodied AI

1 Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment Han-Jun Ko, Jr-Jen Chen, Haobo Yuan, Hsin-Ying Lee, Tiancheng Shen, Ming-Hsuan Yang, Yu-Chiang Frank Wang

Embodied AI Physical Reasoning VLM Alignment

cs.AI cs.CV

Paper 1 / arXiv:2607.06522 Open arXiv

Relevance 8 Novelty 7

Why selected: Matches criterion 1 very closely: it proposes a new method for physical reasoning in embodied settings, specifically aligning reasoning with visual action outcomes to improve task generalization.

Vision-language models (VLMs) struggle to generalize in interactive physical reasoning, particularly under unseen tasks and environments. Two key failure modes are prominent: hallucinated chain-of-thought (CoT) reasoning that contradicts physical reality, and misalignment between the model's reasoning and actions. We present VAORA (Visual Action Outcome Reasoning Alignment), a novel reward design that directly addresses both issues. VAORA introduces two complementary rewards: Visual Alignment Reward, which anchors VLM reasoning to the visual context independent of the agent action itself, and Visual-Action Alignment Reward, which grounds reasoning in the visual outcome induced by the model's action. Together, these rewards suppress hallucinated CoT and reduce the gap between reasoning and behavior. To improve training stability, we further employ smooth, dense rewards by estimating success probabilities using a pre-trained in-domain expert agent. Experiments on PHYRE and Virtual Tool support our performances across novel-task and unseen-environment settings, confirming that grounded and generalizable physical intelligence can be induced through VAORA.

CAD Agents

8 ArtisanCAD: An Industrial-Level CAD Agent with Expert-Grounded Knowledge Distillation Yunhan Xu, Qifeng Wu, Xunjin Li, Yuanwei Bin, Qingsong Yao, Jianghang Gu, Guan Wang, Weihao Lv, Huiyu Yang, Wenfa Luo, Jiao Xiang, Yuntian Chen, Shiyi Chen

CAD Agents 3D Generation Knowledge Distillation

cs.AI cs.GR

Paper 8 / arXiv:2607.05750 Open arXiv

Relevance 6 Novelty 6

Why selected: Matches criterion 3 moderately well: an industrial CAD agent for procedural 3D design, which is a novel embodied/interactive design workflow with expert knowledge distillation.

Computer-aided design (CAD) for industrial components requires long-horizon procedural modeling, robust feature dependencies, editable parametric geometry, and production-grade B-Rep execution. Existing text-to-CAD methods have made promising progress in generating CAD programs from natural-language descriptions, but they still struggle when user prompts are ambiguous, underspecified, or only describe high-level design intent. They also rarely exploit expert procedural knowledge naturally available in industrial workflows, such as CATIA operation recordings, macro logs, drawing notes, and engineering descriptions. We present \algname, a skill-guided industrial CAD agent with expert-grounded knowledge distillation. The core of \algname is CAD intermediate representation (CAD-IR), an executable procedural representation that encodes parameters, ordered operations, MCP tool bindings, dependencies, generated entities, and verification rules. CAD-IR plays two key roles: it first serves as the carrier for distilling expert CAD procedures into reusable parameterized skills; then it provides a procedural scaffold that turns vague or intermediate-level prompts into complete executable CAD operations. \algname retrieves expert-derived skills, instantiates and revises CAD-IR, executes the resulting procedure through a dedicated CATIA-MCP backend, and uses multi-view visual feedback for iterative refinement, and finally generates production-ready B-Rep models. On the Text2CAD benchmark, CAD-IR improves generation from intermediate prompts by reducing mean Chamfer Distance from $14.83$ to $9.88$, showing its ability to bridge ambiguous textual intent and executable CAD construction. On four complex automotive components, CAD-IR enables expert CATIA recordings to be distilled into reusable skills, allowing \algname to generate editable CATIA-native B-Rep models for new variant requests.

cs.CV

10 papers

Vision Foundation Models

2 Vision as Unified Multimodal Generation Xiaoyang Han, Jianhua Li, Kewang Deng, Zukai Chen, Xuanke Shi, Sihan Wang, Boxuan Li, Linyan Wang, Siyi Xie, Xin You, Jinsheng Quan, Zhongang Cai, Haiwen Diao, Ziwei Liu, Lei Yang, Dahua Lin, Quan Wang

Vision Foundation Models Unified Multimodal Generation Dense Prediction

cs.CV

Paper 2 / arXiv:2607.06560 Open arXiv

Relevance 8 Novelty 7

Why selected: Matches criterion 4 very closely: it is about a unified multimodal generation formulation for broad computer vision tasks, i.e., a vision foundation model-style approach and applications across many vision problems.

We formulate computer vision as unified multimodal generation, where heterogeneous visual tasks are expressed in the native text and image generation spaces of a unified multimodal model, without task-specific architectures. Under this formulation, SenseNova-Vision uses natural-language instructions and optional visual prompts to specify tasks, target regions or views, and decoding conventions, and generates responses as text for symbolic outputs, images for dense spatial predictions, or mixed text-and-image outputs for compositional tasks. To support large-scale training, we convert diverse computer vision annotations into instruction-response examples compatible with these generation spaces, resulting in the SenseNova-Vision Corpus, a computer-vision instruction-response corpus spanning text, image, and mixed targets. Starting from an off-the-shelf pretrained unified multimodal model, SenseNova-Vision is trained primarily on this corpus, with auxiliary multimodal data used as a capability-preserving mixture, and requires no task-specific prediction heads or architectural modifications. The resulting model covers a broad range of vision tasks, including detection, OCR, keypoint estimation, segmentation, depth estimation, surface normal prediction, point maps, and camera pose estimation, while supporting language-defined variants that combine category, color, region, and other visual cues. Experiments show that a single unified model can match leading task-specialized systems across structured visual understanding, dense geometric prediction, segmentation, and multi-view visual geometry. These results suggest unified multimodal generation as a scalable route for integrating computer vision capabilities into general-purpose foundation models. The model and corpus are publicly available.

11 Multi-Teacher Contrastive Distillation for Edge-Efficient Pathology Foundation Models Tim Lenz, Maurice Heide, Marco Gustav, Nic G. Reitsam, Jakob Nikolas Kather

Vision Foundation Models Medical AI Contrastive Distillation

cs.CV

Paper 11 / arXiv:2607.05533 Open arXiv

Relevance 4 Novelty 6

Why selected: Matches criterion 4 loosely: a foundation-model distillation method for pathology, which is vision-foundation-model related but not about spatial understanding or embodied AI.

Computational pathology foundation models (PFMs) have advanced whole-slide image analysis. However, their size and inference cost hinder local deployment in pathology departments. We propose MuCoDi, a pretraining framework that distills frozen tile embeddings from multiple PFMs into compact edge-oriented encoders. Instead of regressing individual teacher features, MuCoDi trains lightweight MobileOne and RepViT students with a contrastive distillation objective adapted from MoCo v3, where cached Virchow2, UNI2, and H-Optimus-1 embeddings replace momentum-encoder keys. We pretrain students on 14.3M TCGA tiles from only 11.8K WSIs and evaluate frozen encoders on 23 clinically curated downstream classification tasks. RepViT-based MuCoEdge students retain near-teacher performance while reducing model size by orders of magnitude: MuCoEdge-R2.3 and MuCoEdge-R1.5 reach 71.0% external AUROC, within 0.8 percentage points of the best teacher (Virchow2, 71.8%), while MuCoEdge-R2.3 obtains the best external F1 and the second-best AUPRC (51.8% and 53.3%). MuCoEdge-R1.0 reaches 70.9% AUROC with only 6.4M parameters and 1.12 GFLOPs. On a Raspberry Pi 5, sub-million-parameter MobileOne students achieve up to 605-fold single-tile speedup over Virchow2 while retaining 66.5% to 66.9% external AUROC, demonstrating that PFM-quality pathology representations can be moved toward practical edge deployment. Code is available at https://anonymous.4open.science/r/mucodi-6243.

3D Vision-Language

3 Ground3D-LMM: Fine-Grained 3D Point Grounding and Spatial Reasoning with LMM Amol Harsh, Zongyan Han, Jean Lahoud, Ye Liu, Rao Muhammad Anwer, Hisham Cholakkal, Salman Khan, Fahad Khan

3D Vision-Language Spatial Reasoning Benchmark & Evaluation

cs.CV

Paper 3 / arXiv:2607.05493 Open arXiv

Relevance 8 Novelty 7

Why selected: Matches criterion 1 and 3 closely: a 3D LMM for fine-grained point grounding and metric spatial reasoning, with a new 3D grounded measurement dataset.

Natural-language queries about 3D environments become actionable when responses are verifiable and metric. Verifiability requires explicit grounding to the referred 3D region, while metric answers report physical measurements in real-world units (e.g., size, thickness, clearance, and distance). Existing 3D large multimodal models (LMMs) approaches remain limited: conversational systems typically respond without explicit 3D grounding, while 3D grounding models are not designed for interactive, metric-aware dialogue. In this paper, we present Ground3D-LMM, a unified model that takes a point cloud and an optional RGB image as input and supports 3D spatial conversation with (i) point-grounded responses and (ii) metric numeric outputs at both object and part granularity, including multi-object queries. To evaluate this intersection of grounding and measurement, we define the 3D Grounded Measurement task, which requires predicting the referred 3D region and the corresponding metric quantities in real-world units. We introduce a large-scale dataset built on ScanNet and ScanNet++ datasets with dense object and part annotations and roughly 2.5M question-answer pairs spanning eight tasks, along with a manually verified test set. Extensive experiments on multiple datasets and tasks show that our proposed Ground3D-LMM model provides a strong baseline for grounded, metric-aware 3D conversational understanding. Our dataset and model are publicly available.

World Models

4 MoWorld: A Flash World Model Team Moxin, Deyi Ji, Tianrun Chen, Xin Zhang, Jiale Yang, Qi Zhu, An Zhao, Zihao Xie, Han Wang, Xuanyi Liu, Yixiang Zhou, Pei Liu, Yi Tan, Cheng Chen, Dayi Zhu, Mingyu Wei, Hanjie Xu, Jun Liao, Siqi Li, Lingyu Lu, Hongye Fang, Hongming Tan, Youjiang Zhu, Taiyu Zhang, Zejian Li, Chaotao Ding, Lanyun Zhu, Yunhe Pan, Lingyun Sun

World Models Efficient Inference 3D Data Generation

cs.CV

Paper 4 / arXiv:2607.06216 Open arXiv

Relevance 8 Novelty 7

Why selected: Matches criterion 4 closely: a new world model built for efficient real-time deployment, with 3D-native data and practical inference optimizations.

The future of World Models depends not only on scaling model capability, but also on scaling practicality and inference efficiency. High-frame-rate inference enables responsive perception, planning, and control in real-world autonomous systems. To this end, we present MoWorld, a cost-effective yet high-performance Flash World Model with an end-to-end framework spanning data generation, pre-training, distillation, and efficient inference, enabling up to 50 FPS real-time interaction with cinematic visual quality without the need of high-end GPUs. To enable large-scale real-world deployment, MoWorld jointly optimizes model capability and cost throughout the entire development pipeline. Specifically, unlike existing approaches that primarily rely on large-scale video corpora, MoWorld is built upon a scalable 3D-native data engine accumulated from our large-scale 3D vision and generative modeling pipeline, enabling the efficient construction of geometrically consistent training data across diverse real-world and synthetic environments. Based on this foundation, a curriculum cross-frame pre-training strategy for stable and scalable World Model learning, an efficient denoising-step distillation algorithm to reduce diffusion training cost, and a mixed-precision parallel inference framework for low-cost real-time deployment. MoWorld is the first real-time interactive World Model built on the Neural Processing Unit (NPU) and can achieves up to 50 FPS in such the devices, enabling practical and efficient deployment at scale. Comprehensive evaluations demonstrate that MoWorld achieves leading performance; notably, its average inference cost is only 30\%-50\% of that of existing World Models, providing a practical foundation for large-scale real-world applications of World Models. We also demonstrate diverse applications of MoWorld.

Autonomous Driving Benchmark

5 Benchmarking the Robustness of Autonomous Driving to Environmental Illusions: A Lane Perception Perspective Tianyuan Zhang, Xianglong Liu, Aishan Liu, Lu Wang, Yitong Zhang, Peng Yue, Mingchuan Zhang, Siyuan Liang, Dacheng Tao

Autonomous Driving Benchmark CARLA Simulator Robustness Evaluation

cs.CV

Paper 5 / arXiv:2607.05783 Open arXiv

Relevance 7 Novelty 6

Why selected: Matches criterion 3 very closely: it builds a new simulator-based benchmark for embodied/autonomous driving robustness, focusing on overlooked environmental illusions and also proposes a defense method.

Environmental illusions (eg., shadows, reflections, and tire marks) are naturally existing yet overlooked phenomena in real-world driving environments. They can disturb visual perception, leading to misinterpretation of the scene and posing serious safety risks to autonomous driving (AD) systems. However, existing researches largely overlook these phenomena, leaving a critical gap. To address this issue, we study AD robustness through the lane perception perspective, a fundamental task supporting core functions like cruise control and lane centering. We focus on two representative models: conventional lane detection (LD) and vision-language model-based systems (ADVLMs). In this work, we introduce the first benchmark, LanEvil++, for evaluating the robustness of lane perception under environmental illusions. LanEvil++ encompasses 14 types of illusions and leverages the CARLA simulator to generate 94 high-fidelity, fully controllable 3D scenes, yielding a dataset of 90,292 annotated images, 1,596 video clips, and 41,855 visual question answering pairs. Extensive evaluations demonstrate that environmental illusions substantially degrade the performance of state-of-the-art LD methods. On average, LD models experience a 5.27% drop in Accuracy and a 10.49% decline in F1-score, while ADVLMs show a 2.03% reduction in GPT-score and a 0.75% drop in Language-score. Among all illusions, shadows emerge as the most disruptive factor, reducing accuracy by up to 7.20%. Furthermore, closed-loop simulations reveal that these illusions can lead to incorrect driving decisions. Complementary real-world case studies highlight safety-critical failures in actual traffic scenes. To enhance robustness, we propose the Multimodal Illusion Defense Approach (MIDA). MIDA achieves substantial gains under challenging conditions, boosting robustness by 4.23% on LD models and 3.82% on ADVLMs.

Multimodal LLMs

6 Propose and Attend: Training-free MLLM Grounding Confidence via Multi-Token Localized Attention Daniel Shalam, Emanuel Ben Baruch, Avi Ben Cohen, Tal Remez

Multimodal LLMs Grounding Confidence Attention Analysis

cs.CV cs.AI

Paper 6 / arXiv:2607.05978 Open arXiv

Relevance 7 Novelty 6

Why selected: Matches criterion 2 closely: a new training-free grounding-confidence method for MLLMs with multi-token localized attention.

Multimodal large language models can emit localized predictions, bounding boxes for objects and temporal windows for video and audio events, but they hallucinate these regions prolifically. The model's own token log-probabilities are nearly uninformative: they conflate grounding quality with input ambiguity, and coordinate tokens become near-deterministic once the model commits. We propose Multi-Token Localized Attention (MTLA): a training-free, post-hoc score that measures how strongly a prediction's tokens attend to the region they claim. Prior attention-based detectors, which sum attention over the entire input modality and read a single response token, are weaker special cases; we show that summing only within the claimed region and aggregating across all prediction tokens recovers a stronger grounding signal. The same recipe applies almost trivially to other modalities and tasks: object detection in images and temporal localization in video and audio. Across multiple MLLM families and three modalities, MTLA improves hallucination AUROC by +7 to +38 over the best prior training-free baseline. Used as a confidence score for re-ranking, it nearly doubles the zero-shot COCO detection AP of an open-source 8B generalist (from 20.4 to 37.0), narrowing the gap to supervised detectors without any task-specific training.

Vision-Language Models

7 Analysis-by-Proxy: Localization Signals in VLMs Operating as Condition Encoders Yoav Baron, Sara Dorfman, Roni Paiss, Daniel Cohen-Or, Or Patashnik

Vision-Language Models Localization Model Interpretability

cs.CV cs.AI

Paper 7 / arXiv:2607.06445 Open arXiv

Relevance 6 Novelty 7

Why selected: Matches criterion 1 closely: it studies localization signals and spatial understanding inside VLMs, especially when used as condition encoders for image editing.

Vision-Language Models (VLMs) are increasingly utilized as the conditioning backbone for diffusion-based image editing due to their remarkable multimodal reasoning capabilities. While standalone VLMs demonstrate strong localization capabilities, editing pipelines frequently struggle to maintain this accuracy, particularly in complex, multi-entity scenes. In this work, we investigate this performance gap, hypothesizing that it stems from treating the VLM as a condition encoder. In this role, the model is restricted to a single forward pass, preventing the autoregressive generation process for which it was optimized, thereby failing to fully expose its capabilities. To investigate whether this spatial understanding persists when the VLM is used as a condition encoder, we introduce Analysis-by-Proxy. In this framework, we train a lightweight, interpretable proxy model on the VLM's intermediate representations using an auxiliary localization task. By analyzing the VLM through this proxy, we uncover the specific VLM representations that encode localization information. Our findings expose a fundamental mismatch between how spatial knowledge is represented within a VLM condition encoder and how it is extracted by current editing pipelines. We reveal that under single-pass constraints, the localization signal does not reliably propagate to the predefined layer configurations commonly used for conditioning. Instead, this crucial signal remains hidden within intermediate representations, at locations that vary depending on the input prompt. Using our introduced Analysis-by-Proxy framework, we reveal the fundamental failures of existing condition extraction strategies in editing pipelines, opening the door to more principled design of conditioning architectures.

12 PolicyShiftGuard: Benchmarking and Improving Policy-Adaptive Image Guardrails Mingyang Song, Luxin Xu, Haoyu Sun, Minzhou Pan, Yu Cheng, Bo Li

Vision-Language Models Safety Guardrails Benchmark & Evaluation

cs.CV cs.AI cs.CL

Paper 12 / arXiv:2607.05910 Open arXiv

Relevance 3 Novelty 5

Why selected: Matches criterion 1 only loosely: policy-adaptive image guardrails involve VLM behavior, but this is safety/benchmarking rather than spatial understanding or embodied AI.

Image guardrails are typically trained and evaluated under a fixed safety policy, implicitly treating safety as an intrinsic property of an image. Real deployments are different: the same image may be allowed in one product, restricted in another, and newly disallowed when a policy boundary changes. We study policy-adaptive image guardrailing, where a model must decide whether an image violates the currently supplied policy and generalize to held-out policy definitions. We introduce PolicyShiftBench, a comprehensive benchmark with 2,000 policy-discriminative instances over 265 images, where each image is paired with 7.55 policy-conditioned prompts on average to test whether models adapt to the active policy rather than relying on image-level safety priors. We then propose PolicyShiftGuard, a compact policy-conditioned guardrail trained with a two-stage training recipe that combines Randomized Policy SFT (RP-SFT) with Boundary-Pair Policy Adaptation (BP-Adapt). BP-Adapt trains matched prompts for the same image and risk category using standard label supervision and a pairwise comparison loss that separates blocking policies from passing policies. Experiments show that existing VLMs and specialized guardrails remain brittle under policy shifts, while PolicyShiftGuard substantially improves policy-sensitive performance. The 7B model achieves SOTA performance of 76.9 Avg. F1 and 72.1 Avg. PSS on PolicyShiftBench, transfers well to UnSafeBench and SafeEditBench, and improves the latency-performance trade-off with a concise output format. Ablations confirm that matched pass/block boundary pairs are essential for stable policy adaptation.

Autonomous Driving

9 Synthetic-to-Real Translation for Class-Agnostic Motion Prediction Yizheng Wu, Hongwei Fan, Kewei Wang, Ruibo Li, Xingyi Li, Xiao Song, Zhe Wang, Chenjing Ding, Dongliang Wang, Zhiguo Cao, Guosheng Lin

Autonomous Driving Synthetic-to-Real Transfer Motion Prediction

cs.CV

Paper 9 / arXiv:2607.06319 Open arXiv

Relevance 6 Novelty 5

Why selected: Matches criterion 3: it introduces a simulator/synthetic-data pipeline and a new method for motion prediction via synthetic-to-real transfer, with a new synthetic 4D LiDAR dataset.

Motion understanding is critical for ensuring safety and robustness in autonomous driving systems, driving increasing interest in motion prediction. A key challenge in this domain is the high cost associated with acquiring real-world motion labels. It is therefore ideal if we could transfer motion knowledge from synthetic data to real data. In this context, we explore the potential of synthetic-to-real translation for motion prediction (SRMP). However, the most used naive motion regression methods are notably sensitive to the synthetic-to-real domain shift, resulting in unreliable knowledge translation. To address this, we propose a novel approach integrating a motion knowledge translation framework with two key components: (1) objectness-aware motion prediction, which explicitly models the joint distribution of motion patterns and objectness priors to improve domain-invariant feature learning, and (2) objectness-aided motion enhancement, a motion label refinement mechanism that leverages learned objectness priors to filter motion noise. Furthermore, we present a physically-based pipeline for generating Motion4D, the first synthetic 4D LiDAR dataset tailored for SRMP research, addressing the lack of synthetic motion datasets. Experimental results demonstrate that our approach effectively bridges the domain gaps and yields superior performance on real scenes.

3D Dense Captioning

10 PVCap: Towards Accurate 3D Dense Captioning via PseudoCap and VoxelCapNet Xiaopei Wu, Chenshu Hou, Liang Peng, Dan Xu, Binbin Lin, Xiaoshui Huang, Yuenan Hou, Yu Li, Wenxiao Wang, Haifeng Liu, Deng Cai, Wanli Ouyang

3D Dense Captioning Embodied Vision-Language Spatial Augmentation

cs.CV cs.AI

Paper 10 / arXiv:2607.06097 Open arXiv

Relevance 6 Novelty 5

Why selected: Matches criterion 3: a 3D embodied perception task paper that builds a new method for 3D dense captioning with spatial-layout-aware augmentation and a stronger voxel-based captioning baseline.

3D dense captioning, an emerging vision-language task, aims to generate descriptive sentences for each object in the 3D scene. Despite the impressive results achieved by previous methods, they suffer from two limitations. First, current research often employs global rigid transformations, such as rotation, to augment scenes without changing their spatial layouts. However, diverse spatial layouts are crucial for training a 3D dense captioning model to describe spatial relations between objects. Second, previous works mainly focus on the design of the caption generation pipeline while utilizing a simple network architecture for other components, i.e., backbone and detection head, which is crucial for extracting rich semantic information for captioning. In this paper, we propose PVCap to alleviate the aforementioned problems. Our PVCap consists of PseudoCap and VoxelCapNet. Specifically, PseudoCap employs a random mixing technique on instances within the dataset, generating numerous pseudo frames with diverse spatial layouts at the instance level. By utilizing a teacher-student framework, PseudoCap obtains pseudo caption labels for these pseudo frames. This data augmentation approach significantly increases the number of training samples and enhances the model's ability to describe the environment effectively. Regarding VoxelCapNet, we introduce a robust caption network that utilizes voxel features and adapts the caption head to the voxel-based network architecture. Our VoxelCapNet can serve as a competitive baseline for future research on 3D dense captioning. Extensive experiments are conducted on two prevalent benchmarks, i.e., ScanRefer and Nr3D. Notably, our method surpasses current state-of-the-art by 11.41% and 13.99% in CIDEr@0.5IoU, respectively. Codes will be made publicly available.

Past ArXiv

July 07, 2026 July 03, 2026 July 02, 2026 July 01, 2026 June 30, 2026 June 29, 2026 June 26, 2026 June 25, 2026 June 24, 2026 June 20, 2026 June 19, 2026 June 18, 2026 June 17, 2026 June 16, 2026 June 15, 2026 June 12, 2026 June 11, 2026 June 10, 2026 June 09, 2026 June 08, 2026

Paper selection prompt

 1. New methodological improvements to spatial understanding, spatial intelligence on embodied agents;
 2. Shows new VLLMs (visual large language models) or MLLMs (multi-modal large language models)
 3. Embodied AI papers on buliding new benchmark (simulator related) or new methods. These papers should focus on novel angles that previous work ignored.
 4. Vision foundation models related and its applications.

 In suggesting papers to your friend, remember that he enjoys papers on computer vision and machine learning, and generative modeling in multi-modal learning.
 Your friend also likes learning about surprising empirical or insightful results in vision-language models or embodied AI, as well as clever statistical tricks.