Personalized Daily ArXiv Papers 01/06/2025

Total relevant papers: 41

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

AR4D: Autoregressive 4D Generation from Monocular Videos Authors: Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding Authors: Heqing Zou (Xiao Jie), Tianze Luo (Xiao Jie), Guiyang Xie (Xiao Jie), Victor (Xiao Jie), Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding Authors: Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li
SDPO: Segment-Level Direct Preference Optimization for Social Agents Authors: Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping Authors: Tengfei Wang, Xin Wang, Yongmao Hou, Yiwei Xu, Wendi Zhang, Zongqian Zhan
LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction Authors: Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, Gerhard Lakemeyer, Oliver Simons, Johannes Stegmaier
Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision Authors: Alberta Longhini, Marcel B"usching, Bardienus P. Duisterhof, Jens Lundell, Jeffrey Ichnowski, M{\aa}rten Bj"orkman, Danica Kragic
AgentRefine: Enhancing Agent Generalization through Refinement Tuning Authors: Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, Weiran Xu
ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation Authors: Luca Collorone, Stefano D’Arrigo, Massimiliano Pappa, Guido Maria D’Amely di Melendugno, Giovanni Ficarra, Fabio Galasso
MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning Authors: Pu Yang, Bin Dong
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders Authors: Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, Shanghang Zhang
CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction Authors: Chenhao Zhang, Yuanping Cao, Lei Zhang
A Minimal Subset Approach for Efficient and Scalable Loop Closure Authors: Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos
IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution Authors: Athanasios Tragakis, Chaitanya Kaul, Kevin J. Mitchell, Hang Dai, Roderick Murray-Smith, Daniele Faccio
VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment Authors: Wenyan Cong, Kevin Wang, Jiahui Lei, Colton Stearns, Yuanhao Cai, Dilin Wang, Rakesh Ranjan, Matt Feiszli, Leonidas Guibas, Zhangyang Wang, Weiyao Wang, Zhiwen Fan
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models Authors: Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Enhancing Large Vision Model in Street Scene Semantic Understanding through Leveraging Posterior Optimization Trajectory Authors: Wei-Bin Kou, Qingfeng Lin, Ming Tang, Shuai Wang, Rongguang Ye, Guangxu Zhu, Yik-Chung Wu
IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks Authors: Aecheon Jung, Soyun Choi, Junhong Min, Sungeun Hong
SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers Authors: Bhavna Gopal, Huanrui Yang, Mark Horton, Yiran Chen
BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems Authors: Yinbo Yu, Saihao Yan, Xueyu Yin, Jing Fang, Jiajia Liu
Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels Authors: Ruitao Pu, Yuan Sun, Yang Qin, Zhenwen Ren, Xiaomin Song, Huiming Zheng, Dezhong Peng
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation Authors: Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, Hujun Bao
Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation Authors: Junjie Xu, Xingjiao Wu, Tanren Yao, Zihao Zhang, Jiayang Bei, Wu Wen, Liang He
JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing Authors: Qili Wang, Dajiang Wu, Zihang Xu, Junshi Huang, Jun Lv
Few-shot Implicit Function Generation via Equivariance Authors: Suizhi Huang, Xingyi Yang, Hongtao Lu, Xinchao Wang
Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection Authors: Shuguang Wang, Yuanjing Wang
KeyNode-Driven Geometry Coding for Real-World Scanned Human Dynamic Mesh Compression Authors: Huong Hoang, Truong Nguyen, Pamela Cosman
ACE: Anti-Editing Concept Erasure in Text-to-Image Models Authors: Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, Wangmeng Zuo
Multi-modal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds Authors: Simon B. Jensen, Stefan Oehmcke, Andreas M{\o}gelmose, Meysam Madadi, Christian Igel, Sergio Escalera, Thomas B. Moeslund
From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive Learning Authors: Haoyi Wang, Victor Sanchez, Chang-Tsun Li, Nathan Clarke
Adverse Weather Conditions Augmentation of LiDAR Scenes with Latent Diffusion Models Authors: Andrea Matteazzi, Pascal Colling, Michael Arnold, Dietmar Tutsch
Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network Authors: Hiep Dinh, Son Le, My Than, Minh Ho, Nicolas Vuillerme, Hieu Pham
VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement Authors: Jiachen Li, Shisheng Guo, Longzhen Tang, Cuolong Cui, Lingjiang Kong, Xiaobo Yang
UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery Authors: Huaxiang Zhang, Kai Liu, Zhongxue Gan, Guo-Niu Zhu
Detecting and Mitigating Adversarial Attacks on Deep Learning-Based MRI Reconstruction Without Any Retraining Authors: Mahdi Saberi, Chi Zhang, Mehmet Akcakaya
Dedicated Inference Engine and Binary-Weight Neural Networks for Lightweight Instance Segmentation Authors: Tse-Wei Chen, Wei Tao, Dongyue Zhao, Kazuhiro Mima, Tadayuki Ito, Kinya Osa, Masami Kato
Semantic Segmentation for Sequential Historical Maps by Learning from Only One Map Authors: Yunshuang Yuan, Frank Thiemann, Monika Sester
iCBIR-Sli: Interpretable Content-Based Image Retrieval with 2D Slice Embeddings Authors: Shuhei Tomoshige, Hayato Muraki, Kenichi Oishi, Hitoshi Iyatomi
Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible User Perception Authors: Yunzhe Li, Facheng Hu, Hongzi Zhu, Quan Liu, Xiaoke Zhao, Jiangang Shen, Shan Chang, Minyi Guo

0. AR4D: Autoregressive 4D Generation from Monocular Videos

ArXiv ID: 2501.01722 Authors: Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian

Abstract: Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame’s 3D representation based on its previous frame’s representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame’s 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

Comment: Matches criteria 1 and 3 closely with new methodological improvements in 4D generation and spatial understanding. Relevance: 5 Novelty: 7

1. HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

ArXiv ID: 2501.01645 Authors: Heqing Zou (Xiao Jie), Tianze Luo (Xiao Jie), Guiyang Xie (Xiao Jie), Victor (Xiao Jie), Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang

Abstract: Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Comment: 3 Relevance: 5 Novelty: 7

2. Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

ArXiv ID: 2501.01926 Authors: Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.

Comment: Matches criterion 2 for addressing hallucination in large vision-language models. Relevance: 5 Novelty: 7

ArXiv ID: 2501.01821 Authors: Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang

Abstract: Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

Comment: The paper proposes a new method for optimizing multi-turn agent behavior, which could be relevant to criterion 3 as it involves novel methods for embodied AI. Relevance: 5 Novelty: 6

4. PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping

ArXiv ID: 2501.01677 Authors: Tengfei Wang, Xin Wang, Yongmao Hou, Yiwei Xu, Wendi Zhang, Zongqian Zhan

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a transformative method in the field of real-time novel synthesis. Based on 3DGS, recent advancements cope with large-scale scenes via spatial-based partition strategy to reduce video memory and optimization time costs. In this work, we introduce a parallel Gaussian splatting method, termed PG-SAG, which fully exploits semantic cues for both partitioning and Gaussian kernel optimization, enabling fine-grained building surface reconstruction of large-scale urban areas without downsampling the original image resolution. First, the Cross-modal model - Language Segment Anything is leveraged to segment building masks. Then, the segmented building regions is grouped into sub-regions according to the visibility check across registered images. The Gaussian kernels for these sub-regions are optimized in parallel with masked pixels. In addition, the normal loss is re-formulated for the detected edges of masks to alleviate the ambiguities in normal vectors on edges. Finally, to improve the optimization of 3D Gaussians, we introduce a gradient-constrained balance-load loss that accounts for the complexity of the corresponding scenes, effectively minimizing the thread waiting time in the pixel-parallel rendering stage as well as the reconstruction lost. Extensive experiments are tested on various urban datasets, the results demonstrated the superior performance of our PG-SAG on building surface reconstruction, compared to several state-of-the-art 3DGS-based methods. Project Web:https://github.com/TFWang-9527/PG-SAG.

Comment: Matches criteria 4 with applications of vision foundation models in urban building reconstruction. Relevance: 5 Novelty: 6

5. LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

ArXiv ID: 2501.01767 Authors: Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, Gerhard Lakemeyer, Oliver Simons, Johannes Stegmaier

Abstract: Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image’s visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along with explanations of anomalies. This significantly outperforms the existing SOTA method by a large margin.

Comment: Matches criteria 2 with the use of AVLMs for anomaly detection. Relevance: 5 Novelty: 6

6. Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision

ArXiv ID: 2501.01715 Authors: Alberta Longhini, Marcel B"usching, Bardienus P. Duisterhof, Jens Lundell, Jeffrey Ichnowski, M{\aa}rten Bj"orkman, Danica Kragic

Abstract: We introduce Cloth-Splatting, a method for estimating 3D states of cloth from RGB images through a prediction-update framework. Cloth-Splatting leverages an action-conditioned dynamics model for predicting future states and uses 3D Gaussian Splatting to update the predicted states. Our key insight is that coupling a 3D mesh-based representation with Gaussian Splatting allows us to define a differentiable map between the cloth state space and the image space. This enables the use of gradient-based optimization techniques to refine inaccurate state estimates using only RGB supervision. Our experiments demonstrate that Cloth-Splatting not only improves state estimation accuracy over current baselines but also reduces convergence time.

Comment: Matches criteria 1 with new methods for 3D cloth state estimation using RGB images. Relevance: 5 Novelty: 6

ArXiv ID: 2501.01702 Authors: Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, Weiran Xu

Abstract: Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the insight, we propose a novel AgentRefine framework for agent-tuning. The core idea is to enable the model to learn to correct its mistakes via observation in the trajectory. Specifically, we propose an agent synthesis framework to encompass a diverse array of environments and tasks and prompt a strong LLM to refine its error action according to the environment feedback. AgentRefine significantly outperforms state-of-the-art agent-tuning work in terms of generalization ability on diverse agent tasks. It also has better robustness facing perturbation and can generate diversified thought in inference. Our findings establish the correlation between agent generalization and self-refinement and provide a new paradigm for future research.

Comment: Matches criteria 3 with a focus on new methods for agent generalization and self-refinement. Relevance: 5 Novelty: 6

8. ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation

ArXiv ID: 2501.01877 Authors: Luca Collorone, Stefano D’Arrigo, Massimiliano Pappa, Guido Maria D’Amely di Melendugno, Giovanni Ficarra, Fabio Galasso

Abstract: We introduce the novel task of Crowd Volume Estimation (CVE), defined as the process of estimating the collective body volume of crowds using only RGB images. Besides event management and public safety, CVE can be instrumental in approximating body weight, unlocking weight sensitive applications such as infrastructure stress assessment, and assuring even weight balance. We propose the first benchmark for CVE, comprising ANTHROPOS-V, a synthetic photorealistic video dataset featuring crowds in diverse urban environments. Its annotations include each person’s volume, SMPL shape parameters, and keypoints. Also, we explore metrics pertinent to CVE, define baseline models adapted from Human Mesh Recovery and Crowd Counting domains, and propose a CVE specific methodology that surpasses baselines. Although synthetic, the weights and heights of individuals are aligned with the real-world population distribution across genders, and they transfer to the downstream task of CVE from real images. Benchmark and code are available at github.com/colloroneluca/Crowd-Volume-Estimation.

Comment: 3 Relevance: 5 Novelty: 6

9. MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

ArXiv ID: 2501.01834 Authors: Pu Yang, Bin Dong

Abstract: Image captioning is a critical task at the intersection of computer vision and natural language processing, with wide-ranging applications across various domains. For complex tasks such as diagnostic report generation, deep learning models require not only domain-specific image-caption datasets but also the incorporation of relevant general knowledge to provide contextual accuracy. Existing approaches exhibit inherent limitations: specialized models excel in capturing domain-specific details but lack generalization, while vision-language models (VLMs) built on large language models (LLMs) leverage general knowledge but struggle with domain-specific adaptation. To address these limitations, this paper proposes a novel agent-enhanced model collaboration framework, which we called \textbf{MoColl}, designed to effectively integrate domain-specific and general knowledge. Specifically, our approach is to decompose complex image captioning tasks into a series of interconnected question-answer subtasks. A trainable visual question answering (VQA) model is employed as a specialized tool to focus on domain-specific visual analysis, answering task-specific questions based on image content. Concurrently, an LLM-based agent with general knowledge formulates these questions and synthesizes the resulting question-answer pairs into coherent captions. Beyond its role in leveraging the VQA model, the agent further guides its training to enhance its domain-specific capabilities. Experimental results on radiology report generation validate the effectiveness of the proposed framework, demonstrating significant improvements in the quality of generated reports.

Comment: 2 Relevance: 5 Novelty: 6

10. MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

ArXiv ID: 2501.01709 Authors: Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, Shanghang Zhang

Abstract: Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. The code will be released.

Comment: Matches criterion 2 as it discusses a novel framework for vision-language models (VLMs) with a focus on knowledge distillation. Relevance: 5 Novelty: 6

11. CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction

ArXiv ID: 2501.01695 Authors: Chenhao Zhang, Yuanping Cao, Lei Zhang

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent method for scene representation and reconstruction, leveraging densely distributed Gaussian primitives to enable real-time rendering of high-resolution images. While existing 3DGS methods perform well in scenes with minor view variation, large view changes in cross-view scenes pose optimization challenges for these methods. To address these issues, we propose a novel cross-view Gaussian Splatting method for large-scale scene reconstruction, based on dual-branch fusion. Our method independently reconstructs models from aerial and ground views as two independent branches to establish the baselines of Gaussian distribution, providing reliable priors for cross-view reconstruction during both initialization and densification. Specifically, a gradient-aware regularization strategy is introduced to mitigate smoothing issues caused by significant view disparities. Additionally, a unique Gaussian supplementation strategy is utilized to incorporate complementary information of dual-branch into the cross-view model. Extensive experiments on benchmark datasets demonstrate that our method achieves superior performance in novel view synthesis compared to state-of-the-art methods.

Comment: Matches criterion 4 for vision foundation models related to large-scale scene reconstruction. Relevance: 5 Novelty: 6

12. A Minimal Subset Approach for Efficient and Scalable Loop Closure

ArXiv ID: 2501.01791 Authors: Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos

Abstract: Loop closure detection in large-scale and long-term missions can be computationally demanding due to the need to identify, verify, and process numerous candidate pairs to establish edge connections for the pose graph optimization. Keyframe sampling mitigates this by reducing the number of frames stored and processed in the back-end system. In this article, we address the gap in optimized keyframe sampling for the combined problem of pose graph optimization and loop closure detection. Our Minimal Subset Approach (MSA) employs an optimization strategy with two key factors, redundancy minimization and information preservation, within a sliding window framework to efficiently reduce redundant keyframes, while preserving essential information. This method delivers comparable performance to baseline approaches, while enhancing scalability and reducing computational overhead. Finally, we evaluate MSA on relevant publicly available datasets, showcasing that it consistently performs across a wide range of environments, without requiring any manual parameter tuning.

Comment: Matches criterion 3 for new methods in embodied AI focusing on loop closure detection. Relevance: 5 Novelty: 6

13. IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

ArXiv ID: 2501.01723 Authors: Athanasios Tragakis, Chaitanya Kaul, Kevin J. Mitchell, Hang Dai, Roderick Murray-Smith, Daniele Faccio

Abstract: Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for $\times 4$, $\times 8$, and $\times 16$ upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.

Comment: Matches criterion 1 for new methodological improvements in spatial understanding through depth super-resolution. Relevance: 5 Novelty: 6

14. VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

ArXiv ID: 2501.01949 Authors: Wenyan Cong, Kevin Wang, Jiahui Lei, Colton Stearns, Yuanhao Cai, Dilin Wang, Rakesh Ranjan, Matt Feiszli, Leonidas Guibas, Zhangyang Wang, Weiyao Wang, Zhiwen Fan

Abstract: Efficiently reconstructing accurate 3D models from monocular video is a key challenge in computer vision, critical for advancing applications in virtual reality, robotics, and scene understanding. Existing approaches typically require pre-computed camera parameters and frame-by-frame reconstruction pipelines, which are prone to error accumulation and entail significant computational overhead. To address these limitations, we introduce VideoLifter, a novel framework that leverages geometric priors from a learnable model to incrementally optimize a globally sparse to dense 3D representation directly from video sequences. VideoLifter segments the video sequence into local windows, where it matches and registers frames, constructs consistent fragments, and aligns them hierarchically to produce a unified 3D model. By tracking and propagating sparse point correspondences across frames and fragments, VideoLifter incrementally refines camera poses and 3D structure, minimizing reprojection error for improved accuracy and robustness. This approach significantly accelerates the reconstruction process, reducing training time by over 82% while surpassing current state-of-the-art methods in visual fidelity and computational efficiency.

Comment: Matches criterion 1 as it introduces a novel framework for 3D reconstruction from video, which is related to spatial understanding. Relevance: 5 Novelty: 6

15. Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

ArXiv ID: 2501.01720 Authors: Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

Abstract: Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model’s supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model’s perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.

Comment: 2 Relevance: 5 Novelty: 6

16. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

ArXiv ID: 2501.01957 Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

Comment: 2 Relevance: 5 Novelty: 6

17. Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

ArXiv ID: 2501.01904 Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

Abstract: Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

Comment: Matches criterion 2 as it explores a multimodal large language model (MLLM) with a focus on slow-thinking reasoning systems. Relevance: 5 Novelty: 5

18. Enhancing Large Vision Model in Street Scene Semantic Understanding through Leveraging Posterior Optimization Trajectory

ArXiv ID: 2501.01710 Authors: Wei-Bin Kou, Qingfeng Lin, Ming Tang, Shuai Wang, Rongguang Ye, Guangxu Zhu, Yik-Chung Wu

Abstract: To improve the generalization of the autonomous driving (AD) perception model, vehicles need to update the model over time based on the continuously collected data. As time progresses, the amount of data fitted by the AD model expands, which helps to improve the AD model generalization substantially. However, such ever-expanding data is a double-edged sword for the AD model. Specifically, as the fitted data volume grows to exceed the the AD model’s fitting capacities, the AD model is prone to under-fitting. To address this issue, we propose to use a pretrained Large Vision Models (LVMs) as backbone coupled with downstream perception head to understand AD semantic information. This design can not only surmount the aforementioned under-fitting problem due to LVMs’ powerful fitting capabilities, but also enhance the perception generalization thanks to LVMs’ vast and diverse training data. On the other hand, to mitigate vehicles’ computational burden of training the perception head while running LVM backbone, we introduce a Posterior Optimization Trajectory (POT)-Guided optimization scheme (POTGui) to accelerate the convergence. Concretely, we propose a POT Generator (POTGen) to generate posterior (future) optimization direction in advance to guide the current optimization iteration, through which the model can generally converge within 10 epochs. Extensive experiments demonstrate that the proposed method improves the performance by over 66.48\% and converges faster over 6 times, compared to the existing state-of-the-art approach.

Comment: Matches criterion 4 as it discusses enhancing large vision models, which is related to vision foundation models. Relevance: 5 Novelty: 5

19. IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks

ArXiv ID: 2501.01685 Authors: Aecheon Jung, Soyun Choi, Junhong Min, Sungeun Hong

Abstract: Image segmentation is a vital task for providing human assistance and enhancing autonomy in our daily lives. In particular, RGB-D segmentation-leveraging both visual and depth cues-has attracted increasing attention as it promises richer scene understanding than RGB-only methods. However, most existing efforts have primarily focused on semantic segmentation and thus leave a critical gap. There is a relative scarcity of instance-level RGB-D segmentation datasets, which restricts current methods to broad category distinctions rather than fully capturing the fine-grained details required for recognizing individual objects. To bridge this gap, we introduce three RGB-D instance segmentation benchmarks, distinguished at the instance level. These datasets are versatile, supporting a wide range of applications from indoor navigation to robotic manipulation. In addition, we present an extensive evaluation of various baseline models on these benchmarks. This comprehensive analysis identifies both their strengths and shortcomings, guiding future work toward more robust, generalizable solutions. Finally, we propose a simple yet effective method for RGB-D data integration. Extensive evaluations affirm the effectiveness of our approach, offering a robust framework for advancing toward more nuanced scene understanding.

Comment: Matches criterion 3 as it introduces new RGB-D instance segmentation benchmarks, which is related to embodied AI and new benchmarks. Relevance: 5 Novelty: 5

20. SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

ArXiv ID: 2501.01529 Authors: Bhavna Gopal, Huanrui Yang, Mark Horton, Yiran Chen

Abstract: Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets.

Comment: The paper presents a novel method for improving robustness in vision transformers, which relates to vision foundation models. Relevance: 4 Novelty: 6

21. BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

ArXiv ID: 2501.01593 Authors: Yinbo Yu, Saihao Yan, Xueyu Yin, Jing Fang, Jiajia Liu

Abstract: Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c-MADRL, BLAST, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the \textit{leverage attack effect} that can pry open the entire multi-agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

Comment: The paper discusses a novel backdoor attack in multi-agent reinforcement learning, which could be relevant to embodied AI and new methods. Relevance: 3 Novelty: 6

ArXiv ID: 2501.01699 Authors: Ruitao Pu, Yuan Sun, Yang Qin, Zhenwen Ren, Xiaomin Song, Huiming Zheng, Dezhong Peng

Abstract: Cross-modal hashing (CMH) has appeared as a popular technique for cross-modal retrieval due to its low storage cost and high computational efficiency in large-scale data. Most existing methods implicitly assume that multi-modal data is correctly labeled, which is expensive and even unattainable due to the inevitable imperfect annotations (i.e., noisy labels) in real-world scenarios. Inspired by human cognitive learning, a few methods introduce self-paced learning (SPL) to gradually train the model from easy to hard samples, which is often used to mitigate the effects of feature noise or outliers. It is a less-touched problem that how to utilize SPL to alleviate the misleading of noisy labels on the hash model. To tackle this problem, we propose a new cognitive cross-modal retrieval method called Robust Self-paced Hashing with Noisy Labels (RSHNL), which can mimic the human cognitive process to identify the noise while embracing robustness against noisy labels. Specifically, we first propose a contrastive hashing learning (CHL) scheme to improve multi-modal consistency, thereby reducing the inherent semantic gap. Afterward, we propose center aggregation learning (CAL) to mitigate the intra-class variations. Finally, we propose Noise-tolerance Self-paced Hashing (NSH) that dynamically estimates the learning difficulty for each instance and distinguishes noisy labels through the difficulty level. For all estimated clean pairs, we further adopt a self-paced regularizer to gradually learn hash codes from easy to hard. Extensive experiments demonstrate that the proposed RSHNL performs remarkably well over the state-of-the-art CMH methods.

Comment: The paper introduces a new method for cross-modal retrieval, which is related to multi-modal learning. Relevance: 3 Novelty: 6

23. MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

ArXiv ID: 2501.01808 Authors: Huaize Liu, Wenzhang Sun, Donglin Di, Shibo Sun, Jiahui Yang, Changqing Zou, Hujun Bao

Abstract: The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a)the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b)the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1)the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2)the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.

Comment: Relevance: 3 Novelty: 5

24. Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation

ArXiv ID: 2501.01700 Authors: Junjie Xu, Xingjiao Wu, Tanren Yao, Zihao Zhang, Jiayang Bei, Wu Wen, Liang He

Abstract: Emotional information is essential for enhancing human-computer interaction and deepening image understanding. However, while deep learning has advanced image recognition, the intuitive understanding and precise control of emotional expression in images remain challenging. Similarly, music research largely focuses on theoretical aspects, with limited exploration of its emotional dimensions and their integration with visual arts. To address these gaps, we introduce EmoMV, an emotion-driven music-to-visual manipulation method that manipulates images based on musical emotions. EmoMV combines bottom-up processing of music elements-such as pitch and rhythm-with top-down application of these emotions to visual aspects like color and lighting. We evaluate EmoMV using a multi-scale framework that includes image quality metrics, aesthetic assessments, and EEG measurements to capture real-time emotional responses. Our results demonstrate that EmoMV effectively translates music’s emotional content into visually compelling images, advancing multimodal emotional integration and opening new avenues for creative industries and interactive technologies.

Comment: Does not match any specific criteria but is related to multi-modal learning and generative modeling. Relevance: 3 Novelty: 5

25. JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing

ArXiv ID: 2501.01798 Authors: Qili Wang, Dajiang Wu, Zihang Xu, Junshi Huang, Jun Lv

Abstract: Significant progress has been made in talking-face video generation research; however, precise lip-audio synchronization and high visual quality remain challenging in editing lip shapes based on input audio. This paper introduces JoyGen, a novel two-stage framework for talking-face generation, comprising audio-driven lip motion generation and visual appearance synthesis. In the first stage, a 3D reconstruction model and an audio2motion model predict identity and expression coefficients respectively. Next, by integrating audio features with a facial depth map, we provide comprehensive supervision for precise lip-audio synchronization in facial generation. Additionally, we constructed a Chinese talking-face dataset containing 130 hours of high-quality video. JoyGen is trained on the open-source HDTF dataset and our curated dataset. Experimental results demonstrate superior lip-audio synchronization and visual quality achieved by our method.

Comment: Relevance: 3 Novelty: 5

26. Few-shot Implicit Function Generation via Equivariance

ArXiv ID: 2501.01601 Authors: Suizhi Huang, Xingyi Yang, Hongtao Lu, Xinchao Wang

Abstract: Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.

Comment: Relevance: 3 Novelty: 5

27. Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection

ArXiv ID: 2501.01548 Authors: Shuguang Wang, Yuanjing Wang

Abstract: This paper presents a novel neural network architecture featuring automatic fixation point selection, designed to efficiently address complex tasks with reduced network size and computational overhead. The proposed model consists of: a low-resolution channel that captures low-resolution global features from input images; a high-resolution channel that sequentially extracts localized high-resolution features; and a hybrid encoding module that integrates the features from both channels. A defining characteristic of the hybrid encoding module is the inclusion of a fixation point generator, which dynamically produces fixation points, enabling the high-resolution channel to focus on regions of interest. The fixation points are generated in a task-driven manner, enabling the automatic selection of regions of interest. This approach avoids exhaustive high-resolution analysis of the entire image, maintaining task performance and computational efficiency.

Comment: Relevance: 3 Novelty: 5

28. KeyNode-Driven Geometry Coding for Real-World Scanned Human Dynamic Mesh Compression

ArXiv ID: 2501.01717 Authors: Huong Hoang, Truong Nguyen, Pamela Cosman

Abstract: The compression of real-world scanned 3D human dynamic meshes is an emerging research area, driven by applications such as telepresence, virtual reality, and 3D digital streaming. Unlike synthesized dynamic meshes with fixed topology, scanned dynamic meshes often not only have varying topology across frames but also scan defects such as holes and outliers, increasing the complexity of prediction and compression. Additionally, human meshes often combine rigid and non-rigid motions, making accurate prediction and encoding significantly more difficult compared to objects that exhibit purely rigid motion. To address these challenges, we propose a compression method designed for real-world scanned human dynamic meshes, leveraging embedded key nodes. The temporal motion of each vertex is formulated as a distance-weighted combination of transformations from neighboring key nodes, requiring the transmission of solely the key nodes’ transformations. To enhance the quality of the KeyNode-driven prediction, we introduce an octree-based residual coding scheme and a Dual-direction prediction mode, which uses I-frames from both directions. Extensive experiments demonstrate that our method achieves significant improvements over the state-of-the-art, with an average bitrate saving of 24.51% across the evaluated sequences, particularly excelling at low bitrates.

Comment: Relevance: 3 Novelty: 5

29. ACE: Anti-Editing Concept Erasure in Text-to-Image Models

ArXiv ID: 2501.01633 Authors: Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, Wangmeng Zuo

Abstract: Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available at https://github.com/120L020904/ACE.

Comment: Relevance: 3 Novelty: 5

ArXiv ID: 2501.01728 Authors: Simon B. Jensen, Stefan Oehmcke, Andreas M{\o}gelmose, Meysam Madadi, Christian Igel, Sergio Escalera, Thomas B. Moeslund

Abstract: Accurate assessment of forest biodiversity is crucial for ecosystem management and conservation. While traditional field surveys provide high-quality assessments, they are labor-intensive and spatially limited. This study investigates whether deep learning-based fusion of close-range sensing data from 2D orthophotos (12.5 cm resolution) and 3D airborne laser scanning (ALS) point clouds (8 points/m^2) can enhance biodiversity assessment. We introduce the BioVista dataset, comprising 44.378 paired samples of orthophotos and ALS point clouds from temperate forests in Denmark, designed to explore multi-modal fusion approaches for biodiversity potential classification. Using deep neural networks (ResNet for orthophotos and PointVector for ALS point clouds), we investigate each data modality’s ability to assess forest biodiversity potential, achieving mean accuracies of 69.4% and 72.8%, respectively. We explore two fusion approaches: a confidence-based ensemble method and a feature-level concatenation strategy, with the latter achieving a mean accuracy of 75.5%. Our results demonstrate that spectral information from orthophotos and structural information from ALS point clouds effectively complement each other in forest biodiversity assessment.

Comment: Relevance: 3 Novelty: 5

31. From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive Learning

ArXiv ID: 2501.01760 Authors: Haoyi Wang, Victor Sanchez, Chang-Tsun Li, Nathan Clarke

Abstract: Generalized age feature extraction is crucial for age-related facial analysis tasks, such as age estimation and age-invariant face recognition (AIFR). Despite the recent successes of models in homogeneous-dataset experiments, their performance drops significantly in cross-dataset evaluations. Most of these models fail to extract generalized age features as they only attempt to map extracted features with training age labels directly without explicitly modeling the natural progression of aging. In this paper, we propose Order-Enhanced Contrastive Learning (OrdCon), which aims to extract generalized age features to minimize the domain gap across different datasets and scenarios. OrdCon aligns the direction vector of two features with either the natural aging direction or its reverse to effectively model the aging process. The method also leverages metric learning which is incorporated with a novel soft proxy matching loss to ensure that features are positioned around the center of each age cluster with minimum intra-class variance. We demonstrate that our proposed method achieves comparable results to state-of-the-art methods on various benchmark datasets in homogeneous-dataset evaluations for both age estimation and AIFR. In cross-dataset experiments, our method reduces the mean absolute error by about 1.38 in average for age estimation task and boosts the average accuracy for AIFR by 1.87%.

Comment: The paper focuses on age-invariant face recognition using contrastive learning, which is a specific application of vision models. Relevance: 3 Novelty: 5

32. Adverse Weather Conditions Augmentation of LiDAR Scenes with Latent Diffusion Models

ArXiv ID: 2501.01761 Authors: Andrea Matteazzi, Pascal Colling, Michael Arnold, Dietmar Tutsch

Abstract: LiDAR scenes constitute a fundamental source for several autonomous driving applications. Despite the existence of several datasets, scenes from adverse weather conditions are rarely available. This limits the robustness of downstream machine learning models, and restrains the reliability of autonomous driving systems in particular locations and seasons. Collecting feature-diverse scenes under adverse weather conditions is challenging due to seasonal limitations. Generative models are therefore essentials, especially for generating adverse weather conditions for specific driving scenarios. In our work, we propose a latent diffusion process constituted by autoencoder and latent diffusion models. Moreover, we leverage the clear condition LiDAR scenes with a postprocessing step to improve the realism of the generated adverse weather condition scenes.

Comment: The paper discusses generative modeling for LiDAR scenes, which could be relevant to spatial understanding in autonomous driving. Relevance: 3 Novelty: 5

33. Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network

ArXiv ID: 2501.01689 Authors: Hiep Dinh, Son Le, My Than, Minh Ho, Nicolas Vuillerme, Hieu Pham

Abstract: Gait and movement analysis have become a well-established clinical tool for diagnosing health conditions, monitoring disease progression for a wide spectrum of diseases, and to implement and assess treatment, surgery and or rehabilitation interventions. However, quantitative motion assessment remains limited to costly motion capture systems and specialized personnel, restricting its accessibility and broader application. Recent advancements in deep neural networks have enabled quantitative movement analysis using single-camera videos, offering an accessible alternative to conventional motion capture systems. In this paper, we present an efficient approach for clinical gait analysis through a dual-pattern input convolutional Transformer network. The proposed system leverages a dual-input Transformer model to estimate essential gait parameters from single RGB videos captured by a single-view camera. The system demonstrates high accuracy in estimating critical metrics such as the gait deviation index (GDI), knee flexion angle, step length, and walking cadence, validated on a dataset of individuals with movement disorders. Notably, our approach surpasses state-of-the-art methods in various scenarios, using fewer resources and proving highly suitable for clinical application, particularly in resource-constrained environments.

Comment: Relevance: 3 Novelty: 4

34. VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

ArXiv ID: 2501.01691 Authors: Jiachen Li, Shisheng Guo, Longzhen Tang, Cuolong Cui, Lingjiang Kong, Xiaobo Yang

Abstract: Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.

Comment: Does not match any specific criteria but is related to computer vision and machine learning. Relevance: 3 Novelty: 4

35. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery

ArXiv ID: 2501.01855 Authors: Huaxiang Zhang, Kai Liu, Zhongxue Gan, Guo-Niu Zhu

Abstract: Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: https://github.com/ValiantDiligent/UAV-DETR

Comment: Does not match any specific criteria but is related to computer vision and machine learning. Relevance: 3 Novelty: 4

36. Detecting and Mitigating Adversarial Attacks on Deep Learning-Based MRI Reconstruction Without Any Retraining

ArXiv ID: 2501.01908 Authors: Mahdi Saberi, Chi Zhang, Mehmet Akcakaya

Abstract: Deep learning (DL) methods, especially those based on physics-driven DL, have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, or attacks, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining and may lower reconstruction quality for non-perturbed/clean inputs. In this work, we propose a novel approach for detecting and mitigating adversarial attacks on MRI reconstruction models without any retraining. Our detection strategy is based on the idea of cyclic measurement consistency. The output of the model is mapped to another set of MRI measurements for a different sub-sampling pattern, and this synthesized data is reconstructed with the same model. Intuitively, without an attack, the second reconstruction is expected to be consistent with the first, while with an attack, disruptions are present. Subsequently, this idea is extended to devise a novel objective function, which is minimized within a small ball around the attack input for mitigation. Experimental results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods that involve retraining.

Comment: This paper does not match any specific criteria closely. Relevance: 3 Novelty: 4

37. Dedicated Inference Engine and Binary-Weight Neural Networks for Lightweight Instance Segmentation

ArXiv ID: 2501.01841 Authors: Tse-Wei Chen, Wei Tao, Dongyue Zhao, Kazuhiro Mima, Tadayuki Ito, Kinya Osa, Masami Kato

Abstract: Reducing computational costs is an important issue for development of embedded systems. Binary-weight Neural Networks (BNNs), in which weights are binarized and activations are quantized, are employed to reduce computational costs of various kinds of applications. In this paper, a design methodology of hardware architecture for inference engines is proposed to handle modern BNNs with two operation modes. Multiply-Accumulate (MAC) operations can be simplified by replacing multiply operations with bitwise operations. The proposed method can effectively reduce the gate count of inference engines by removing a part of computational costs from the hardware system. The architecture of MAC operations can calculate the inference results of BNNs efficiently with only 52% of hardware costs compared with the related works. To show that the inference engine can handle practical applications, two lightweight networks which combine the backbones of SegNeXt and the decoder of SparseInst for instance segmentation are also proposed. The output results of the lightweight networks are computed using only bitwise operations and add operations. The proposed inference engine has lower hardware costs than related works. The experimental results show that the proposed inference engine can handle the proposed instance-segmentation networks and achieves higher accuracy than YOLACT on the “Person” category although the model size is 77.7$\times$ smaller compared with YOLACT.

Comment: This paper does not match any specific criteria closely. Relevance: 3 Novelty: 4

38. Semantic Segmentation for Sequential Historical Maps by Learning from Only One Map

ArXiv ID: 2501.01845 Authors: Yunshuang Yuan, Frank Thiemann, Monika Sester

Abstract: Historical maps are valuable resources that capture detailed geographical information from the past. However, these maps are typically available in printed formats, which are not conducive to modern computer-based analyses. Digitizing these maps into a machine-readable format enables efficient computational analysis. In this paper, we propose an automated approach to digitization using deep-learning-based semantic segmentation, which assigns a semantic label to each pixel in scanned historical maps. A key challenge in this process is the lack of ground-truth annotations required for training deep neural networks, as manual labeling is time-consuming and labor-intensive. To address this issue, we introduce a weakly-supervised age-tracing strategy for model fine-tuning. This approach exploits the similarity in appearance and land-use patterns between historical maps from neighboring time periods to guide the training process. Specifically, model predictions for one map are utilized as pseudo-labels for training on maps from adjacent time periods. Experiments conducted on our newly curated \textit{Hameln} dataset demonstrate that the proposed age-tracing strategy significantly enhances segmentation performance compared to baseline models. In the best-case scenario, the mean Intersection over Union (mIoU) achieved 77.3\%, reflecting an improvement of approximately 20\% over baseline methods. Additionally, the fine-tuned model achieved an average overall accuracy of 97\%, highlighting the effectiveness of our approach for digitizing historical maps.

Comment: Relevance: 3 Novelty: 4

39. iCBIR-Sli: Interpretable Content-Based Image Retrieval with 2D Slice Embeddings

ArXiv ID: 2501.01642 Authors: Shuhei Tomoshige, Hayato Muraki, Kenichi Oishi, Hitoshi Iyatomi

Abstract: Current methods for searching brain MR images rely on text-based approaches, highlighting a significant need for content-based image retrieval (CBIR) systems. Directly applying 3D brain MR images to machine learning models offers the benefit of effectively learning the brain’s structure; however, building the generalized model necessitates a large amount of training data. While models that consider depth direction and utilize continuous 2D slices have demonstrated success in segmentation and classification tasks involving 3D data, concerns remain. Specifically, using general 2D slices may lead to the oversight of pathological features and discontinuities in depth direction information. Furthermore, to the best of the authors’ knowledge, there have been no attempts to develop a practical CBIR system that preserves the entire brain’s structural information. In this study, we propose an interpretable CBIR method for brain MR images, named iCBIR-Sli (Interpretable CBIR with 2D Slice Embedding), which, for the first time globally, utilizes a series of 2D slices. iCBIR-Sli addresses the challenges associated with using 2D slices by effectively aggregating slice information, thereby achieving low-dimensional representations with high completeness, usability, robustness, and interoperability, which are qualities essential for effective CBIR. In retrieval evaluation experiments utilizing five publicly available brain MR datasets (ADNI2/3, OASIS3/4, AIBL) for Alzheimer’s disease and cognitively normal, iCBIR-Sli demonstrated top-1 retrieval performance (macro F1 = 0.859), comparable to existing deep learning models explicitly designed for classification, without the need for an external classifier. Additionally, the method provided high interpretability by clearly identifying the brain regions indicative of the searched-for disease.

Comment: Relevance: 3 Novelty: 4

40. Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible User Perception

ArXiv ID: 2501.01598 Authors: Yunzhe Li, Facheng Hu, Hongzi Zhu, Quan Liu, Xiaoke Zhao, Jiangang Shen, Shan Chang, Minyi Guo

Abstract: A wide range of user perception applications leverage inertial measurement unit (IMU) data for online prediction. However, restricted by the non-i.i.d. nature of IMU data collected from mobile devices, most systems work well only in a controlled setting (e.g., for a specific user in particular postures), limiting application scenarios. To achieve uncontrolled online prediction on mobile devices, referred to as the flexible user perception (FUP) problem, is attractive but hard. In this paper, we propose a novel scheme, called Prism, which can obtain high FUP accuracy on mobile devices. The core of Prism is to discover task-aware domains embedded in IMU dataset, and to train a domain-aware model on each identified domain. To this end, we design an expectation-maximization (EM) algorithm to estimate latent domains with respect to the specific downstream perception task. Finally, the best-fit model can be automatically selected for use by comparing the test sample and all identified domains in the feature space. We implement Prism on various mobile devices and conduct extensive experiments. Results demonstrate that Prism can achieve the best FUP performance with a low latency.

Comment: Relevance: 3 Novelty: 4

Paper selection prompt

New methodological improvements to spatial understanding, spatial intelligence on embodied agents;
Shows new VLLMs (visual large language models) or MLLMs (multi-modal large language models)
Embodied AI papers on buliding new benchmark (simulator related) or new methods. These papers should focus on novel angles that previous work ignored.
Vision foundation models related and its applications.

In suggesting papers to your friend, remember that he enjoys papers on computer vision and machine learning, and generative modeling in multi-modal learning. Your friend also likes learning about surprising empirical or insightful results in vision-language models or embodied AI, as well as clever statistical tricks.

Yifan Li (Jack)

Personalized Daily ArXiv Papers 01/06/2025

0. AR4D: Autoregressive 4D Generation from Monocular Videos

1. HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

2. Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

3. SDPO: Segment-Level Direct Preference Optimization for Social Agents

4. PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping

5. LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

6. Cloth-Splatting: 3D Cloth State Estimation from RGB Supervision

7. AgentRefine: Enhancing Agent Generalization through Refinement Tuning

8. ANTHROPOS-V: benchmarking the novel task of Crowd Volume Estimation

9. MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

10. MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

11. CrossView-GS: Cross-view Gaussian Splatting For Large-scale Scene Reconstruction

12. A Minimal Subset Approach for Efficient and Scalable Loop Closure

13. IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

14. VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

15. Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

16. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

17. Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

18. Enhancing Large Vision Model in Street Scene Semantic Understanding through Leveraging Posterior Optimization Trajectory

19. IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks

20. SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

21. BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

22. Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels

23. MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

24. Aesthetic Matters in Music Perception for Image Stylization: A Emotion-driven Music-to-Visual Manipulation

25. JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing

26. Few-shot Implicit Function Generation via Equivariance

27. Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection

28. KeyNode-Driven Geometry Coding for Real-World Scanned Human Dynamic Mesh Compression

29. ACE: Anti-Editing Concept Erasure in Text-to-Image Models

30. Multi-modal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds

31. From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive Learning

32. Adverse Weather Conditions Augmentation of LiDAR Scenes with Latent Diffusion Models

33. Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network

34. VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

35. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery

36. Detecting and Mitigating Adversarial Attacks on Deep Learning-Based MRI Reconstruction Without Any Retraining

37. Dedicated Inference Engine and Binary-Weight Neural Networks for Lightweight Instance Segmentation

38. Semantic Segmentation for Sequential Historical Maps by Learning from Only One Map

39. iCBIR-Sli: Interpretable Content-Based Image Retrieval with 2D Slice Embeddings

40. Prism: Mining Task-aware Domains in Non-i.i.d. IMU Data for Flexible User Perception

Paper selection prompt