Personalized Daily ArXiv Papers 03/31/2025

Total relevant papers: 54

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

  1. M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? Authors: Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang

  2. EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos Authors: Yuxuan Li, Vijay Veerabadran, Michael L. Iuzzolino, Brett D. Roads, Asli Celikyilmaz, Karl Ridgeway

  3. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin

  4. LIM: Large Interpolator Model for Dynamic Reconstruction Authors: Remy Sabathier, Niloy J. Mitra, David Novotny

  5. ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation Authors: Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung

  6. MVSAnywhere: Zero-Shot Multi-View Stereo Authors: Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, Jamie Watson

  7. Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy

  8. HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery Authors: Jingtao Li, Yingyi Liu, Xinyu Wang, Yunning Peng, Chen Sun, Shaoyu Wang, Zhendong Sun, Tian Ke, Xiao Jiang, Tangwei Lu, Anran Zhao, Yanfei Zhong

  9. UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants Authors: Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu

  10. ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting Authors: Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao

  11. Multi-modal Knowledge Distillation-based Human Trajectory Forecasting Authors: Jaewoo Jeong, Seohee Lee, Daehee Park, Giwon Lee, Kuk-Jin Yoon

  12. Segment Any Motion in Videos Authors: Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang

  13. NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving Authors: Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng

  14. BOOTPLACE: Bootstrapped Object Placement with Detection Transformers Authors: Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng

  15. AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis Authors: Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, Mike Zheng Shou

  16. Learning to Instruct for Visual Instruction Tuning Authors: Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

  17. Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis Authors: Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang

  18. A Dataset for Semantic Segmentation in the Presence of Unknowns Authors: Zakaria Laskar, Tomas Vojir, Matej Grcic, Iaroslav Melekhov, Shankar Gangisettye, Juho Kannala, Jiri Matas, Giorgos Tolias, C. V. Jawahar

  19. Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion Authors: Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu

  20. NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications Authors: Kibon Ku, Talukder Z Jubery, Elijah Rodriguez, Aditya Balu, Soumik Sarkar, Adarsh Krishnamurthy, Baskar Ganapathysubramanian

  21. Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces Authors: Wonhyeok Choi, Kyumin Hwang, Minwoo Choi, Kiljoon Han, Wonjoon Choi, Mingyu Shin, Sunghoon Im

  22. RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations Authors: Bin Zhang, Jinggang Chen, Xiaoyang Qu, Guokuan Li, Kai Lu, Jiguang Wan, Jing Xiao, Jianzong Wang

  23. Skip-Vision: A Comprehensive Framework for Accelerating Vision-Language Models Authors: Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan

  24. Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization Authors: I~nigo Pikabea, I~naki Lacunza, Oriol Pareras, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

  25. ActionStudio: A Lightweight Framework for Data and Training of Action Models Authors: Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

  26. VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection Authors: Bin Zhang, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang

  27. EndoLRMGS: Complete Endoscopic Scene Reconstruction combining Large Reconstruction Modelling and Gaussian Splatting Authors: Xu Wang, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos B. Mazomenos

  28. Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation Authors: Minho Park, Sunghyun Park, Jungsoo Lee, Hyojin Park, Kyuwoong Hwang, Fatih Porikli, Jaegul Choo, Sungha Choi

  29. A Survey on Remote Sensing Foundation Models: From Vision to Multimodality Authors: Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Mingming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, Yunhong Wang

  30. KernelFusion: Assumption-Free Blind Super-Resolution via Patch Diffusion Authors: Oliver Heinimann, Assaf Shocher, Tal Zimbalist, Michal Irani

  31. Parametric Shadow Control for Portrait Generationin Text-to-Image Diffusion Models Authors: Haoming Cai, Tsung-Wei Huang, Shiv Gehlot, Brandon Y. Feng, Sachin Shah, Guan-Ming Su, Christopher Metzler

  32. Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations Authors: Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia

  33. DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness Authors: Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

  34. Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment Authors: Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao

  35. Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment Authors: Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Dylan J. Foster, Akshay Krishnamurthy

  36. Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need Authors: Yang Liu, Feixiang Liu, Jiale Du, Xinbo Gao, Jungong Han

  37. Multi-Task Semantic Communications via Large Models Authors: Wanli Ni, Zhijin Qin, Haofeng Sun, Xiaoming Tao, Zhu Han

  38. Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks Authors: Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, Wei-Shi Zheng

  39. Foveated Instance Segmentation Authors: Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang

  40. Efficient Continual Learning through Frequency Decomposition and Integration Authors: Ruiqi Liu, Boyu Diao, Libo Huang, Hangda Liu, Chuanguang Yang, Zhulin An, Yongjun Xu

  41. Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration Authors: Yujie Chen, Haotong Qin, Zhang Zhang, Michelo Magno, Luca Benini, Yawei Li

  42. Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning Authors: Yang Liu, Xun Zhang, Jiale Du, Xinbo Gao, Jungong Han

  43. Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model Authors: Seyed Hamidreza Nabaei, Zeyang Zheng, Dong Chen, Arsalan Heydarian

  44. Flexible Moment-Invariant Bases from Irreducible Tensors Authors: Roxana Bujack, Emily Shinkle, Alice Allen, Tomas Suk, Nicholas Lubbers

  45. MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets Authors: Zhendi Gong, Susan Francis, Eleanor Cox, Stamatios N. Sotiropoulos, Dorothee P. Auer, Guoping Qiu, Andrew P. French, Xin Chen

  46. One Look is Enough: A Novel Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation Models on High-Resolution Images Authors: Byeongjun Kwon, Munchurl Kim

  47. High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning Authors: Dailan He, Xiahong Wang, Shulun Wang, Guanglu Song, Bingqi Ma, Hao Shao, Yu Liu, Hongsheng Li

  48. Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

  49. Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation Authors: Hongmei Yin, Tingliang Feng, Fan Lyu, Fanhua Shang, Hongying Liu, Wei Feng, Liang Wan

  50. Permutation-Invariant and Orientation-Aware Dataset Distillation for 3D Point Clouds Authors: Jae-Young Yim, Dongwook Kim, Jae-Young Sim

  51. Data Quality Matters: Quantifying Image Quality Impact on Machine Learning Performance Authors: Christian Steinhauser, Philipp Reis, Hubert Padusinski, Jacob Langner, Eric Sax

  52. Synergistic Bleeding Region and Point Detection in Surgical Videos Authors: Jialun Pei, Zhangjun Zhou, Diandian Guo, Zhixi Li, Jing Qin, Bo Du, Pheng-Ann Heng

  53. Contrasting Low and High-Resolution Features for HER2 Scoring using Deep Learning Authors: Ekansh Chauhan, Anila Sharma, Amit Sharma, Vikas Nishadham, Asha Ghughtyal, Ankur Kumar, Gurudutt Gupta, Anurag Mehta, C. V. Jawahar, P. K. Vinod

  54. CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models Authors: Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji


ArXiv ID: 2503.21839 Authors: Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang

Abstract: We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.

Comment: Matches criterion 2 as it evaluates LVLMs for interleaved image-text document summarization and introduces a new benchmark. Relevance: 9 Novelty: 8


ArXiv ID: 2503.22152 Authors: Yuxuan Li, Vijay Veerabadran, Michael L. Iuzzolino, Brett D. Roads, Asli Celikyilmaz, Karl Ridgeway

Abstract: We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.

Comment: Matches criterion 3 as it introduces a new benchmark (EgoToM) for embodied AI focusing on Theory of Mind reasoning from egocentric videos. Relevance: 9 Novelty: 8


ArXiv ID: 2503.22020 Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin

Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Comment: Matches criterion 3 as it introduces a novel method for embodied AI with visual chain-of-thought reasoning for vision-language-action models. Relevance: 9 Novelty: 8


ArXiv ID: 2503.22537 Authors: Remy Sabathier, Niloy J. Mitra, David Novotny

Abstract: Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $t\in[t_0,t_1]$, delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.

Comment: Matches criterion 4 as it introduces a vision foundation model (LIM) for dynamic 4D reconstruction. Relevance: 8 Novelty: 8


ArXiv ID: 2503.22194 Authors: Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung

Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

Comment: Matches criterion 4 as it introduces a novel method for 3D orientation grounding in text-to-image generation, which is a vision foundation model application. Relevance: 8 Novelty: 7


ArXiv ID: 2503.22430 Authors: Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, Jamie Watson

Abstract: Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.

Comment: Matches criterion 1 as it introduces a novel architecture for multi-view stereo depth estimation, focusing on spatial understanding and generalization across diverse domains. Relevance: 8 Novelty: 7


ArXiv ID: 2503.21979 Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy

Abstract: Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present \emph{Harmon}, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at https://github.com/wusize/Harmon.

Comment: Matches criterion 2 as it proposes a unified multimodal framework for visual understanding and generation, which aligns with advancements in VLLMs/MLLMs. Relevance: 8 Novelty: 7


ArXiv ID: 2503.21841 Authors: Jingtao Li, Yingyi Liu, Xinyu Wang, Yunning Peng, Chen Sun, Shaoyu Wang, Zhendong Sun, Tian Ke, Xiao Jiang, Tangwei Lu, Anran Zhao, Yanfei Zhong

Abstract: Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels,existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from $0.4 \sim 2.5 , \mu\text{m}$, supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 datasets.Code and dataset are accessible at https://rsidea.whu.edu.cn/hyperfree.htm.

Comment: Matches criterion 4 as it proposes a tuning-free hyperspectral foundation model for remote sensing imagery. Relevance: 8 Novelty: 7


ArXiv ID: 2503.21820 Authors: Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu

Abstract: Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.

Comment: Matches criterion 4 as it introduces a unified feature matching pre-trained model for multi-modal image applications. Relevance: 8 Novelty: 7


ArXiv ID: 2503.22346 Authors: Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao

Abstract: Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.

Comment: Matches criterion 3 as it introduces a new large-scale benchmark dataset and a novel baseline for architectural CAD symbol spotting. Relevance: 7 Novelty: 8


ArXiv ID: 2503.22201 Authors: Jaewoo Jeong, Seohee Lee, Daehee Park, Giwon Lee, Kuk-Jin Yoon

Abstract: Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.

Comment: Matches criterion 1 as it proposes a new multi-modal knowledge distillation framework for spatial understanding in trajectory forecasting. Relevance: 8 Novelty: 7


ArXiv ID: 2503.22268 Authors: Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang

Abstract: Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.

Comment: Matches criterion 1 as it proposes a novel approach for moving object segmentation with spatial understanding improvements. Relevance: 8 Novelty: 7


ArXiv ID: 2503.22436 Authors: Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng

Abstract: Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

Comment: Matches criteria 3 as it introduces a new benchmark (NuGrounding) for multi-view 3D visual grounding in autonomous driving and proposes a novel method for spatial-semantic feature fusion. Relevance: 8 Novelty: 7


ArXiv ID: 2503.21991 Authors: Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng

Abstract: In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

Comment: Matches criterion 3 as it introduces a novel object placement method using detection transformers, which is a new angle in embodied AI tasks. Relevance: 7 Novelty: 7


ArXiv ID: 2503.21904 Authors: Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, Mike Zheng Shou

Abstract: The rapid advancements in large language models (LLMs) have spurred growing interest in LLM-based video anomaly detection (VAD). However, existing approaches predominantly focus on video-level anomaly question answering or offline detection, ignoring the real-time nature essential for practical VAD applications. To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework. AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement. Notably, we introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold. To enhance the ability to model intricate spatiotemporal relationships in anomaly events, we propose a Spatio-Temporal Relation Distillation (STRD) module. STRD transfers the long-term spatiotemporal modeling capabilities of vision-language models (VLMs) from offline settings to real-time scenarios. Thus it equips AssistPDA with a robust understanding of complex temporal dependencies and long-sequence memory. Additionally, we construct VAPDA-127K, the first large-scale benchmark designed for VLM-based online VAPDA. Extensive experiments demonstrate that AssistPDA outperforms existing offline VLM-based approaches, setting a new state-of-the-art for real-time VAPDA. Our dataset and code will be open-sourced to facilitate further research in the community.

Comment: Matches criterion 3 as it introduces a new framework (AssistPDA) for video anomaly prediction, detection, and analysis, focusing on real-time applications. Relevance: 7 Novelty: 7


ArXiv ID: 2503.22215 Authors: Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

Abstract: We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.

Comment: Matches criterion 2 as it proposes advancements in visual instruction tuning for MLLMs, improving multimodal capabilities. Relevance: 8 Novelty: 6


ArXiv ID: 2503.22420 Authors: Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang

Abstract: Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.

Comment: Matches criterion 3 as it introduces a new benchmark (Beacon3D) for 3D vision-language understanding with novel evaluation methods. Relevance: 7 Novelty: 7


ArXiv ID: 2503.22309 Authors: Zakaria Laskar, Tomas Vojir, Matej Grcic, Iaroslav Melekhov, Shankar Gangisettye, Juho Kannala, Jiri Matas, Giorgos Tolias, C. V. Jawahar

Abstract: Before deployment in the real-world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only knowns or unknowns - but not both, which is required to establish "in the wild" suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in-domain evaluation. The test set consists of a static and temporal part, with the latter comprised of videos. The dataset provides annotations for both closed-set (knowns) and anomalies, enabling closed-set and open-set evaluation. The dataset covers diverse conditions, such as domain and cross-sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of current state-of-the-art methods confirm the need for improvements especially in domain-generalization, small and large object segmentation.

Comment: Matches criterion 3 as it introduces a new dataset for semantic segmentation in the presence of unknowns, focusing on real-world evaluation. Relevance: 7 Novelty: 7


ArXiv ID: 2503.22262 Authors: Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu

Abstract: With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.

Comment: Matches criterion 3 as it introduces a new benchmark and evaluation metric for stereo conversion, focusing on novel angles. Relevance: 7 Novelty: 7


ArXiv ID: 2503.21958 Authors: Kibon Ku, Talukder Z Jubery, Elijah Rodriguez, Aditya Balu, Soumik Sarkar, Adarsh Krishnamurthy, Baskar Ganapathysubramanian

Abstract: This paper presents a NeRF-based framework for point cloud (PCD) reconstruction, specifically designed for indoor high-throughput plant phenotyping facilities. Traditional NeRF-based reconstruction methods require cameras to move around stationary objects, but this approach is impractical for high-throughput environments where objects are rapidly imaged while moving on conveyors or rotating pedestals. To address this limitation, we develop a variant of NeRF-based PCD reconstruction that uses a single stationary camera to capture images as the object rotates on a pedestal. Our workflow comprises COLMAP-based pose estimation, a straightforward pose transformation to simulate camera movement, and subsequent standard NeRF training. A defined Region of Interest (ROI) excludes irrelevant scene data, enabling the generation of high-resolution point clouds (10M points). Experimental results demonstrate excellent reconstruction fidelity, with precision-recall analyses yielding an F-score close to 100.00 across all evaluated plant objects. Although pose estimation remains computationally intensive with a stationary camera setup, overall training and reconstruction times are competitive, validating the method's feasibility for practical high-throughput indoor phenotyping applications. Our findings indicate that high-quality NeRF-based 3D reconstructions are achievable using a stationary camera, eliminating the need for complex camera motion or costly imaging equipment. This approach is especially beneficial when employing expensive and delicate instruments, such as hyperspectral cameras, for 3D plant phenotyping. Future work will focus on optimizing pose estimation techniques and further streamlining the methodology to facilitate seamless integration into automated, high-throughput 3D phenotyping pipelines.

Comment: Matches criterion 3 as it focuses on a novel NeRF-based method for point cloud reconstruction using a stationary camera, which is a unique angle for embodied AI and simulator-related benchmarks. Relevance: 6 Novelty: 7


ArXiv ID: 2503.22209 Authors: Wonhyeok Choi, Kyumin Hwang, Minwoo Choi, Kiljoon Han, Wonjoon Choi, Mingyu Shin, Sunghoon Im

Abstract: Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.

Comment: Matches criterion 1 as it proposes a novel framework for improving spatial understanding in monocular depth estimation, particularly addressing reflective surfaces. Relevance: 7 Novelty: 6


ArXiv ID: 2503.22285 Authors: Bin Zhang, Jinggang Chen, Xiaoyang Qu, Guokuan Li, Kai Lu, Jiguang Wan, Jing Xiao, Jianzong Wang

Abstract: Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.

Comment: Matches criterion 2 as it leverages vision-language representations for object-level out-of-distribution detection, which is relevant to multi-modal large language models. Relevance: 7 Novelty: 6


ArXiv ID: 2503.21817 Authors: Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan

Abstract: Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.

Comment: Matches criterion 2 as it proposes a framework to accelerate vision-language models, which are a type of multi-modal large language model. Relevance: 7 Novelty: 6


ArXiv ID: 2503.22577 Authors: I~nigo Pikabea, I~naki Lacunza, Oriol Pareras, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

Abstract: Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Comment: Matches criterion 2 as it addresses multilingual capabilities in visual language models, which is a novel angle in VLLMs. Relevance: 7 Novelty: 6


ArXiv ID: 2503.22673 Authors: Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Abstract: Action models are essential for enabling autonomous agents to perform complex tasks. However, training large action models remains challenging due to the diversity of agent environments and the complexity of agentic data. Despite growing interest, existing infrastructure provides limited support for scalable, agent-specific fine-tuning. We present ActionStudio, a lightweight and extensible data and training framework designed for action models. ActionStudio unifies heterogeneous agent trajectories through a standardized format, supports diverse training paradigms including LoRA, full fine-tuning, and distributed setups, and integrates robust preprocessing and verification tools. We validate its effectiveness across both public and realistic industry benchmarks, demonstrating strong performance and practical scalability. We open-sourced code and data at https://github.com/SalesforceAIResearch/xLAM to facilitate research in the community.

Comment: Matches criteria 3 as it introduces a lightweight framework (ActionStudio) for training action models, focusing on scalable and agent-specific fine-tuning. Relevance: 7 Novelty: 6


ArXiv ID: 2503.22291 Authors: Bin Zhang, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang

Abstract: As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.

Comment: Matches criteria 2 as it leverages vision-language models (CLIP) for zero-shot object-level OOD detection with novel adaptations. Relevance: 7 Novelty: 6


ArXiv ID: 2503.22437 Authors: Xu Wang, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos B. Mazomenos

Abstract: Complete reconstruction of surgical scenes is crucial for robot-assisted surgery (RAS). Deep depth estimation is promising but existing works struggle with depth discontinuities, resulting in noisy predictions at object boundaries and do not achieve complete reconstruction omitting occluded surfaces. To address these issues we propose EndoLRMGS, that combines Large Reconstruction Modelling (LRM) and Gaussian Splatting (GS), for complete surgical scene reconstruction. GS reconstructs deformable tissues and LRM generates 3D models for surgical tools while position and scale are subsequently optimized by introducing orthogonal perspective joint projection optimization (OPjPO) to enhance accuracy. In experiments on four surgical videos from three public datasets, our method improves the Intersection-over-union (IoU) of tool 3D models in 2D projections by>40%. Additionally, EndoLRMGS improves the PSNR of the tools projection from 3.82% to 11.07%. Tissue rendering quality also improves, with PSNR increasing from 0.46% to 49.87%, and SSIM from 1.53% to 29.21% across all test videos.

Comment: Matches criterion 3 as it focuses on a novel method for complete surgical scene reconstruction, addressing previously ignored challenges in depth discontinuities and occluded surfaces. Relevance: 6 Novelty: 6


ArXiv ID: 2503.22172 Authors: Minho Park, Sunghyun Park, Jungsoo Lee, Hyojin Park, Kyuwoong Hwang, Fatih Porikli, Jaegul Choo, Sungha Choi

Abstract: This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.

Comment: Matches criterion 4 as it focuses on generating segmentation datasets using text-to-image models, which relates to vision foundation models and their applications. Relevance: 5 Novelty: 6


ArXiv ID: 2503.22081 Authors: Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Mingming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, Yunhong Wang

Abstract: The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.

Comment: Matches criterion 4 as it provides a comprehensive review of vision and multimodal foundation models for remote sensing. Relevance: 6 Novelty: 5


ArXiv ID: 2503.21907 Authors: Oliver Heinimann, Assaf Shocher, Tal Zimbalist, Michal Irani

Abstract: Traditional super-resolution (SR) methods assume an ideal'' downscaling SR-kernel (e.g., bicubic downscaling) between the high-resolution (HR) image and the low-resolution (LR) image. Such methods fail once the LR images are generated differently. Current blind-SR methods aim to remove this assumption, but are still fundamentally restricted to rather simplistic downscaling SR-kernels (e.g., anisotropic Gaussian kernels), and fail on more complex (out of distribution) downscaling degradations. However, using the correct SR-kernel is often more important than using a sophisticated SR algorithm. In KernelFusion'' we introduce a zero-shot diffusion-based method that makes no assumptions about the kernel. Our method recovers the unique image-specific SR-kernel directly from the LR input image, while simultaneously recovering its corresponding HR image. KernelFusion exploits the principle that the correct SR-kernel is the one that maximizes patch similarity across different scales of the LR image. We first train an image-specific patch-based diffusion model on the single LR input image, capturing its unique internal patch statistics. We then reconstruct a larger HR image with the same learned patch distribution, while simultaneously recovering the correct downscaling SR-kernel that maintains this cross-scale relation between the HR and LR images. Empirical results show that KernelFusion vastly outperforms all SR baselines on complex downscaling degradations, where existing SotA Blind-SR methods fail miserably. By breaking free from predefined kernel assumptions, KernelFusion pushes Blind-SR into a new assumption-free paradigm, handling downscaling kernels previously thought impossible.

Comment: Does not closely match any specific criterion but is generally relevant to generative modeling and super-resolution tasks. Relevance: 3 Novelty: 6


ArXiv ID: 2503.21943 Authors: Haoming Cai, Tsung-Wei Huang, Shiv Gehlot, Brandon Y. Feng, Sachin Shah, Guan-Ming Su, Christopher Metzler

Abstract: Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.

Comment: Does not match any specific criterion but is tangentially related to generative modeling in multi-modal learning. Relevance: 3 Novelty: 6


ArXiv ID: 2503.21824 Authors: Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia

Abstract: Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.

Comment: Does not match any specific criterion but is tangentially related to vision-language models due to its focus on video-based LLMs. Relevance: 3 Novelty: 6


ArXiv ID: 2503.22677 Authors: Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

Abstract: Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications. One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity. Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models to external feedback, we propose Direct Simulation Optimization (DSO), a framework to use the feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator outputs stable 3D objects directly. We construct a dataset of 3D objects labeled with a stability score obtained from the physics simulator. We can then fine-tune the 3D generator using the stability score as the alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO), a novel objective, which we introduce, to align diffusion models without requiring pairwise preferences. Our experiments show that the fine-tuned feed-forward generator, using either DPO or DRO objective, is much faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework works even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.

Comment: Does not match any specific criterion but is related to generative modeling and simulation feedback, which aligns with your friend's general interest. Relevance: 3 Novelty: 6


ArXiv ID: 2503.22359 Authors: Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao

Abstract: Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation.

Comment: Does not match any specific criterion but is relevant to general computer vision interests, focusing on face alignment and dataset unification. Relevance: 3 Novelty: 5


ArXiv ID: 2503.21878 Authors: Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Dylan J. Foster, Akshay Krishnamurthy

Abstract: Inference-time computation provides an important axis for scaling language model performance, but naively scaling compute through techniques like Best-of-$N$ sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment which we formalize as the problem of improving a pre-trained policy's responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-$N$ alignment with an ideal choice for $N$ can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when $N$ is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce $\texttt{InferenceTimePessimism}$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with $N$, meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of $\texttt{InferenceTimePessimism}$ across a variety of tasks and models.

Comment: Does not closely match any specific criterion but is generally relevant to inference-time optimization and language modeling. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22193 Authors: Yang Liu, Feixiang Liu, Jiale Du, Xinbo Gao, Jungong Han

Abstract: Convolutional neural networks and supervised learning have achieved remarkable success in various fields but are limited by the need for large annotated datasets. Few-shot learning (FSL) addresses this limitation by enabling models to generalize from only a few labeled examples. Transductive few-shot learning (TFSL) enhances FSL by leveraging both labeled and unlabeled data, though it faces challenges like the hubness problem. To overcome these limitations, we propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning through three innovative contributions. First, we introduce a decentralized covariance matrix to mitigate the hubness problem, ensuring a more uniform distribution of embeddings. Second, our method combines local alignment and global uniformity through adaptive weighting and nonlinear transformation, balancing intra-class clustering with inter-class separation. Third, we employ a Variational Sinkhorn Few-Shot Classifier to optimize the distances between samples and class prototypes, enhancing classification accuracy and robustness. These combined innovations allow the UMMEC method to achieve superior performance with minimal labeled data. Our UMMEC method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in TFSL.

Comment: Does not closely match any specific criterion but is generally relevant to few-shot learning and classification tasks. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22064 Authors: Wanli Ni, Zhijin Qin, Haofeng Sun, Xiaoming Tao, Zhu Han

Abstract: Artificial intelligence (AI) promises to revolutionize the design, optimization and management of next-generation communication systems. In this article, we explore the integration of large AI models (LAMs) into semantic communications (SemCom) by leveraging their multi-modal data processing and generation capabilities. Although LAMs bring unprecedented abilities to extract semantics from raw data, this integration entails multifaceted challenges including high resource demands, model complexity, and the need for adaptability across diverse modalities and tasks. To overcome these challenges, we propose a LAM-based multi-task SemCom (MTSC) architecture, which includes an adaptive model compression strategy and a federated split fine-tuning approach to facilitate the efficient deployment of LAM-based semantic models in resource-limited networks. Furthermore, a retrieval-augmented generation scheme is implemented to synthesize the most recent local and global knowledge bases to enhance the accuracy of semantic extraction and content generation, thereby improving the inference performance. Finally, simulation results demonstrate the efficacy of the proposed LAM-based MTSC architecture, highlighting the performance enhancements across various downstream tasks under varying channel conditions.

Comment: Does not closely match any specific criterion but is generally relevant to multi-modal learning and large models. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22405 Authors: Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, Wei-Shi Zheng

Abstract: Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.

Comment: Does not match any specific criteria but is related to error detection in procedural tasks, which is tangentially relevant to embodied AI. Relevance: 3 Novelty: 5


ArXiv ID: 2503.21854 Authors: Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang

Abstract: Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-

Comment: Does not match any specific criteria but is related to efficient instance segmentation for AR/VR, which is tangentially relevant to spatial intelligence. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22175 Authors: Ruiqi Liu, Boyu Diao, Libo Huang, Hangda Liu, Chuanguang Yang, Zhulin An, Yongjun Xu

Abstract: Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-world systems with dynamic data streams. The human perceptual system processes visual scenes through complementary frequency channels: low-frequency signals capture holistic cues, while high-frequency components convey structural details vital for fine-grained discrimination. Inspired by this, we propose the Frequency Decomposition and Integration Network (FDINet), a novel framework that decomposes and integrates information across frequencies. FDINet designs two lightweight networks to independently process low- and high-frequency components of images. When integrated with rehearsal-based methods, this frequency-aware design effectively enhances cross-task generalization through low-frequency information, preserves class-specific details using high-frequency information, and facilitates efficient training due to its lightweight architecture. Experiments demonstrate that FDINet reduces backbone parameters by 78%, improves accuracy by up to 7.49% over state-of-the-art (SOTA) methods, and decreases peak memory usage by up to 80%. Additionally, on edge devices, FDINet accelerates training by up to 5$\times$.

Comment: Does not match any specific criteria but is related to continual learning, which is tangentially relevant to embodied AI. Relevance: 3 Novelty: 5


ArXiv ID: 2503.21970 Authors: Yujie Chen, Haotong Qin, Zhang Zhang, Michelo Magno, Luca Benini, Yawei Li

Abstract: State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR) due to their ability to scale linearly sequence length while effectively capturing long-distance dependencies. However, deploying SSMs to edge devices is challenging due to the constraints in memory, computing capacity, and power consumption, underscoring the need for efficient compression strategies. While low-bit quantization is an efficient model compression strategy for reducing size and accelerating IR tasks, SSM suffers substantial performance drops at ultra-low bit-widths (2-4 bits), primarily due to outliers that exacerbate quantization error. To address this challenge, we propose Q-MambaIR, an accurate, efficient, and flexible Quantized Mamba for IR tasks. Specifically, we introduce a Statistical Dynamic-balancing Learnable Scalar (DLS) to dynamically adjust the quantization mapping range, thereby mitigating the peak truncation loss caused by extreme values. Furthermore, we design a Range-floating Flexible Allocator (RFA) with an adaptive threshold to flexibly round values. This approach preserves high-frequency details and maintains the SSM's feature extraction capability. Notably, RFA also enables pre-deployment weight quantization, striking a balance between computational efficiency and model accuracy. Extensive experiments on IR tasks demonstrate that Q-MambaIR consistently outperforms existing quantized SSMs, achieving much higher state-of-the-art (SOTA) accuracy results with only a negligible increase in training computation and storage saving.

Comment: Does not match any specific criterion but is related to efficient image restoration, which is tangentially relevant to computer vision. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22197 Authors: Yang Liu, Xun Zhang, Jiale Du, Xinbo Gao, Jungong Han

Abstract: Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers' great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

Comment: Does not match any specific criterion but is related to zero-shot learning and multimodal learning, which aligns with your friend's general interest. Relevance: 3 Novelty: 5


ArXiv ID: 2503.21932 Authors: Seyed Hamidreza Nabaei, Zeyang Zheng, Dong Chen, Arsalan Heydarian

Abstract: Indoor gardening within sustainable buildings offers a transformative solution to urban food security and environmental sustainability. By 2030, urban farming, including Controlled Environment Agriculture (CEA) and vertical farming, is expected to grow at a compound annual growth rate (CAGR) of 13.2% from 2024 to 2030, according to market reports. This growth is fueled by advancements in Internet of Things (IoT) technologies, sustainable innovations such as smart growing systems, and the rising interest in green interior design. This paper presents a novel framework that integrates computer vision, machine learning (ML), and environmental sensing for the automated monitoring of plant health and growth. Unlike previous approaches, this framework combines RGB imagery, plant phenotyping data, and environmental factors such as temperature and humidity, to predict plant water stress in a controlled growth environment. The system utilizes high-resolution cameras to extract phenotypic features, such as RGB, plant area, height, and width while employing the Lag-Llama time series model to analyze and predict water stress. Experimental results demonstrate that integrating RGB, size ratios, and environmental data significantly enhances predictive accuracy, with the Fine-tuned model achieving the lowest errors (MSE = 0.420777, MAE = 0.595428) and reduced uncertainty. These findings highlight the potential of multimodal data and intelligent systems to automate plant care, optimize resource consumption, and align indoor gardening with sustainable building management practices, paving the way for resilient, green urban spaces.

Comment: Does not match any specific criterion but is related to multimodal data integration and machine learning applications. Relevance: 3 Novelty: 5


ArXiv ID: 2503.21939 Authors: Roxana Bujack, Emily Shinkle, Alice Allen, Tomas Suk, Nicholas Lubbers

Abstract: Moment invariants are a powerful tool for the generation of rotation-invariant descriptors needed for many applications in pattern detection, classification, and machine learning. A set of invariants is optimal if it is complete, independent, and robust against degeneracy in the input. In this paper, we show that the current state of the art for the generation of these bases of moment invariants, despite being robust against moment tensors being identically zero, is vulnerable to a degeneracy that is common in real-world applications, namely spherical functions. We show how to overcome this vulnerability by combining two popular moment invariant approaches: one based on spherical harmonics and one based on Cartesian tensor algebra.

Comment: Does not match any specific criterion but is related to pattern detection and machine learning, which aligns with your friend's general interest. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22557 Authors: Zhendi Gong, Susan Francis, Eleanor Cox, Stamatios N. Sotiropoulos, Dorothee P. Auer, Guoping Qiu, Andrew P. French, Xin Chen

Abstract: Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: https://github.com/naisops/MO-CTranS.

Comment: Does not match any specific criteria but is relevant to the general interest area of multi-organ segmentation and medical imaging. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22351 Authors: Byeongjun Kwon, Munchurl Kim

Abstract: Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluation on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrates into which our PRO can be well harmonized, making their DE capabilities still effective for the grid input of high-resolution images with little depth discontinuities at the grid boundaries. Our PRO runs fast at inference time.

Comment: Does not match any specific criteria but is relevant to the general interest area of depth estimation and high-resolution image processing. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22179 Authors: Dailan He, Xiahong Wang, Shulun Wang, Guanglu Song, Bingqi Ma, Hao Shao, Yu Liu, Hongsheng Li

Abstract: Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.

Comment: Does not match any specific criteria but is relevant to the general interest area of generative modeling and face-swapping applications. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22225 Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

Abstract: Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

Comment: Does not match any specific criteria but is relevant to the general interest area of generative modeling and temporal consistency in image editing. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22136 Authors: Hongmei Yin, Tingliang Feng, Fan Lyu, Fanhua Shang, Hongying Liu, Wei Feng, Liang Wan

Abstract: In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this issue, this paper proposes a novel Enhanced Instance Replay (EIR) method, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model's capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state-of-the-art CSS methods.

Comment: Does not match any specific criteria but is relevant to the general interest area of continual learning and semantic segmentation. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22154 Authors: Jae-Young Yim, Dongwook Kim, Jae-Young Sim

Abstract: We should collect large amount of data to train deep neural networks for various applications. Recently, the dataset distillation for images and texts has been attracting a lot of attention, that reduces the original dataset to a synthetic dataset while preserving essential task-relevant information. However, 3D point clouds distillation is almost unexplored due to the challenges of unordered structures of points. In this paper, we propose a novel distribution matching-based dataset distillation method for 3D point clouds that jointly optimizes the geometric structures of synthetic dataset as well as the orientations of synthetic models. To ensure the consistent feature alignment between different 3D point cloud models, we devise a permutation invariant distribution matching loss with the sorted feature vectors. We also employ learnable rotation angles to transform each syntheic model according to the optimal orientation best representing the original feature distribution. Extensive experimental results on widely used four benchmark datasets, including ModelNet10, ModelNet40, ShapeNet, and ScanObjectNN, demonstrate that the proposed method consistently outperforms the existing methods.

Comment: Does not match any specific criteria but is relevant to the general interest area of machine learning and dataset distillation. Relevance: 3 Novelty: 5


ArXiv ID: 2503.22375 Authors: Christian Steinhauser, Philipp Reis, Hubert Padusinski, Jacob Langner, Eric Sax

Abstract: Precise perception of the environment is essential in highly automated driving systems, which rely on machine learning tasks such as object detection and segmentation. Compression of sensor data is commonly used for data handling, while virtualization is used for hardware-in-the-loop validation. Both methods can alter sensor data and degrade model performance. This necessitates a systematic approach to quantifying image validity. This paper presents a four-step framework to evaluate the impact of image modifications on machine learning tasks. First, a dataset with modified images is prepared to ensure one-to-one matching image pairs, enabling measurement of deviations resulting from compression and virtualization. Second, image deviations are quantified by comparing the effects of compression and virtualization against original camera-based sensor data. Third, the performance of state-of-the-art object detection models is analyzed to determine how altered input data affects perception tasks, including bounding box accuracy and reliability. Finally, a correlation analysis is performed to identify relationships between image quality and model performance. As a result, the LPIPS metric achieves the highest correlation between image deviation and machine learning performance across all evaluated machine learning tasks.

Comment: Does not match any specific criterion but is tangentially related to your friend's general interest in computer vision and machine learning, focusing on image quality and its impact on ML performance. Relevance: 3 Novelty: 4


ArXiv ID: 2503.22174 Authors: Jialun Pei, Zhangjun Zhou, Diandian Guo, Zhixi Li, Jing Qin, Bo Du, Pheng-Ann Heng

Abstract: Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process. Intelligent detection of bleeding regions can quantify the blood loss to assist decision-making, while locating the bleeding point helps surgeons quickly identify the source of bleeding and achieve hemostasis in time. In this study, we first construct a real-world surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, designed to perform simultaneous detection of bleeding regions and points in surgical videos. Our framework embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2 (SAM 2). The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures the direction of bleed point movement through inter-frame optical flow. By interactive guidance and prompts, the two branches explore potential spatial-temporal relationships while leveraging memory modeling from previous frames to infer the current bleeding condition. Extensive experiments demonstrate that our approach outperforms other counterparts on SurgBlood in both bleeding region and point detection tasks, e.g., achieving 64.88% IoU for bleeding region detection and 83.69% PCK-10% for bleeding point detection.

Comment: Does not match any specific criteria but is related to computer vision applications in surgical videos. Relevance: 3 Novelty: 4


ArXiv ID: 2503.22069 Authors: Ekansh Chauhan, Anila Sharma, Amit Sharma, Vikas Nishadham, Asha Ghughtyal, Ankur Kumar, Gurudutt Gupta, Anurag Mehta, C. V. Jawahar, P. K. Vinod

Abstract: Breast cancer, the most common malignancy among women, requires precise detection and classification for effective treatment. Immunohistochemistry (IHC) biomarkers like HER2, ER, and PR are critical for identifying breast cancer subtypes. However, traditional IHC classification relies on pathologists' expertise, making it labor-intensive and subject to significant inter-observer variability. To address these challenges, this study introduces the India Pathology Breast Cancer Dataset (IPD-Breast), comprising of 1,272 IHC slides (HER2, ER, and PR) aimed at automating receptor status classification. The primary focus is on developing predictive models for HER2 3-way classification (0, Low, High) to enhance prognosis. Evaluation of multiple deep learning models revealed that an end-to-end ConvNeXt network utilizing low-resolution IHC images achieved an AUC, F1, and accuracy of 91.79%, 83.52%, and 83.56%, respectively, for 3-way classification, outperforming patch-based methods by over 5.35% in F1 score. This study highlights the potential of simple yet effective deep learning techniques to significantly improve accuracy and reproducibility in breast cancer classification, supporting their integration into clinical workflows for better patient outcomes.

Comment: Does not match any specific criterion but is related to computer vision and deep learning applications in medical imaging. Relevance: 3 Novelty: 4


ArXiv ID: 2503.22342 Authors: Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji

Abstract: This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.

Comment: Does not match any specific criteria but is relevant to the general interest area of optimization techniques in reasoning models. Relevance: 3 Novelty: 4



Paper selection prompt

  1. New methodological improvements to spatial understanding, spatial intelligence on embodied agents;
  2. Shows new VLLMs (visual large language models) or MLLMs (multi-modal large language models)
  3. Embodied AI papers on buliding new benchmark (simulator related) or new methods. These papers should focus on novel angles that previous work ignored.
  4. Vision foundation models related and its applications.

In suggesting papers to your friend, remember that he enjoys papers on computer vision and machine learning, and generative modeling in multi-modal learning. Your friend also likes learning about surprising empirical or insightful results in vision-language models or embodied AI, as well as clever statistical tricks.