Personalized Daily ArXiv Papers 01/09/2025

Total relevant papers: 29

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs Authors: Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Authors: Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark Authors: Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming-Ming Cheng, Xiang Li
EditAR: Unified Conditional Generation with Autoregressive Models Authors: Jiteng Mu, Nuno Vasconcelos, Xiaolong Wang
Supervision-free Vision-Language Alignment Authors: Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation Authors: Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection Authors: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu
FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using Gaussian Splatting with Depth-Feature Consistency Authors: Han Huang, Yulun Wu, Chao Deng, Ge Gao, Ming Gu, Yu-Shen Liu
Online Gaussian Test-Time Adaptation of Vision-Language Models Authors: Cl'ement Fuchs, Maxime Zanella, Christophe De Vleeschouwer
Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts Authors: Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
Benchmarking Large and Small MLLMs Authors: Xuelu Feng, Yunsheng Li, Dongdong Chen, Mei Gao, Mengchen Liu, Junsong Yuan, Chunming Qiao
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests Authors: Charles Corbi`ere, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi
Hybrid Artificial Intelligence Strategies for Drone Navigation Authors: Rub'en San-Segundo, Luc'ia Angulo, Manuel Gil-Mart'in, David Carrami~nana, Ana M. Bernardos
FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection Authors: Guoxin Zhang, Ziying Song, Lin Liu, Zhonghong Ou
Edit as You See: Image-guided Video Editing via Masked Motion Modeling Authors: Zhi-Lin Huang, Yixuan Liu, Chujun Qin, Zhongdao Wang, Dong Zhou, Dong Li, Emad Barsoum
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation Authors: Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu
ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning Authors: Hyungjin Chung, Dohun Lee, Zihui Wu, Byung-Hoon Kim, Katherine L. Bouman, Jong Chul Ye
Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition Authors: Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh
NSA: Neuro-symbolic ARC Challenge Authors: Pawe{\l} Batorski, Jannik Brinkmann, Paul Swoboda
An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks Authors: Lei Liu, Zhenghao Chen, Zhihao Hu, Dong Xu
TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning Authors: Seungmin Baek, Soyul Lee, Hayeon Jo, Hyesong Choi, Dongbo Min
Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing Authors: Xinghe Fu, Zhiyuan Yan, Taiping Yao, Shen Chen, Xi Li
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition Authors: Bowen Hao, Dongliang Zhou, Xiaojie Li, Xingyu Zhang, Liang Xie, Jianlong Wu, Erwei Yin
Generative Dataset Distillation Based on Self-knowledge Distillation Authors: Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
iFADIT: Invertible Face Anonymization via Disentangled Identity Transform Authors: Lin Yuan, Kai Liang, Xiong Li, Tao Wu, Nannan Wang, Xinbo Gao
Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer
DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models Authors: Hyogon Ryu, NaHyeon Park, Hyunjung Shim
Research on environment perception and behavior prediction of intelligent UAV based on semantic communication Authors: Kechong Ren, Li Gao, Qi Guan
Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images Authors: Ren Tasai, Guang Li, Ren Togo, Minghui Tang, Takaaki Yoshimura, Hiroyuki Sugimori, Kenji Hirata, Takahiro Ogawa, Kohsuke Kudo, Miki Haseyama

0. Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

ArXiv ID: 2501.04336 Authors: Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

Abstract: Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the “Mind Palace”, which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Comment: Matches criteria 1 and 3 closely with new framework for spatial understanding and a new benchmark for embodied AI. Relevance: 8 Novelty: 7

1. Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

ArXiv ID: 2501.04670 Authors: Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi

Abstract: Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at https://github.com/zhouyiks/CoLVA.

Comment: Matches criteria 2 with exploration of MLLMs and a new benchmark for visual matching. Relevance: 7 Novelty: 6

2. RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark

ArXiv ID: 2501.04440 Authors: Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming-Ming Cheng, Xiang Li

Abstract: Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object’s angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: https://github.com/zhasion/RSAR.

Comment: Matches criteria 3 with a new benchmark for SAR object detection and novel angle prediction method. Relevance: 6 Novelty: 6

3. EditAR: Unified Conditional Generation with Autoregressive Models

ArXiv ID: 2501.04699 Authors: Jiteng Mu, Nuno Vasconcelos, Xiaolong Wang

Abstract: Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods. Project page: https://jitengmu.github.io/EditAR/

Comment: Matches criterion 4 as it discusses a unified autoregressive model for various image generation tasks, which is related to vision foundation models. Relevance: 5 Novelty: 7

4. Supervision-free Vision-Language Alignment

ArXiv ID: 2501.04568 Authors: Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, Aleix Martinez

Abstract: Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.

Comment: Matches criteria 2 as it discusses improvements in vision-language models. Relevance: 5 Novelty: 7

5. MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

ArXiv ID: 2501.04614 Authors: Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda

Abstract: Artificial Intelligence is revolutionizing medical practice, enhancing diagnostic accuracy and healthcare delivery. However, its adaptation in medical settings still faces significant challenges, related to data availability and privacy constraints. Synthetic data has emerged as a promising solution to mitigate these issues, addressing data scarcity while preserving privacy. Recently, Latent Diffusion Models have emerged as a powerful tool for generating high-quality synthetic data. Meanwhile, the integration of different modalities has gained interest, emphasizing the need of models capable of handle multimodal medical data.Existing approaches struggle to integrate complementary information and lack the ability to generate modalities simultaneously. To address this challenge, we present MedCoDi-M, a 6.77-billion-parameter model, designed for multimodal medical data generation, that, following Foundation Model paradigm, exploits contrastive learning and large quantity of data to build a shared latent space which capture the relationships between different data modalities. Further, we introduce the Multi-Prompt training technique, which significantly boosts MedCoDi-M’s generation under different settings. We extensively validate MedCoDi-M: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we assess the utility of MedCoDi-M in addressing key challenges in the medical field, such as anonymization, data scarcity and imbalance learning. The results are promising, demonstrating the applicability of MedCoDi-M in medical contexts. Project page is at https://cosbidev.github.io/MedCoDi-M/.

Comment: Matches criteria 2 as it presents a new multi-modal large language model for medical data. Relevance: 5 Novelty: 7

6. InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

ArXiv ID: 2501.04575 Authors: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu

Abstract: Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.

Comment: Matches criterion 2: Shows new MLLMs. Relevance: 5 Novelty: 6

7. FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using Gaussian Splatting with Depth-Feature Consistency

ArXiv ID: 2501.04628 Authors: Han Huang, Yulun Wu, Chao Deng, Ge Gao, Ming Gu, Yu-Shen Liu

Abstract: Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multi-view reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.

Comment: Matches criterion 1: New methodological improvements to spatial understanding. Relevance: 5 Novelty: 6

8. Online Gaussian Test-Time Adaptation of Vision-Language Models

ArXiv ID: 2501.04352 Authors: Cl'ement Fuchs, Maxime Zanella, Christophe De Vleeschouwer

Abstract: Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .

Comment: 2 Relevance: 5 Novelty: 6

9. Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

ArXiv ID: 2501.04322 Authors: Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

Abstract: Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.

Comment: 2 Relevance: 5 Novelty: 6

10. Benchmarking Large and Small MLLMs

ArXiv ID: 2501.04150 Authors: Xuelu Feng, Yunsheng Li, Dongdong Chen, Mei Gao, Mengchen Liu, Junsong Yuan, Chunming Qiao

Abstract: Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.

Comment: Matches criterion 2 as it discusses benchmarking of MLLMs, including both large and small models. Relevance: 5 Novelty: 6

11. DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

ArXiv ID: 2501.04671 Authors: Charles Corbi`ere, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi

Abstract: Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs’ ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7\% when reasoning over image tokens of cropped regions tied to these entities.

Comment: Matches criteria 2 as it discusses visual large language models and their reasoning capabilities. Relevance: 5 Novelty: 6

ArXiv ID: 2501.04472 Authors: Rub'en San-Segundo, Luc'ia Angulo, Manuel Gil-Mart'in, David Carrami~nana, Ana M. Bernardos

Abstract: Objective: This paper describes the development of hybrid artificial intelligence strategies for drone navigation. Methods: The navigation module combines a deep learning model with a rule-based engine depending on the agent state. The deep learning model has been trained using reinforcement learning. The rule-based engine uses expert knowledge to deal with specific situations. The navigation module incorporates several strategies to explain the drone decision based on its observation space, and different mechanisms for including human decisions in the navigation process. Finally, this paper proposes an evaluation methodology based on defining several scenarios and analyzing the performance of the different strategies according to metrics adapted to each scenario. Results: Two main navigation problems have been studied. For the first scenario (reaching known targets), it has been possible to obtain a 90% task completion rate, reducing significantly the number of collisions thanks to the rule-based engine. For the second scenario, it has been possible to reduce 20% of the time required to locate all the targets using the reinforcement learning model. Conclusions: Reinforcement learning is a very good strategy to learn policies for drone navigation, but in critical situations, it is necessary to complement it with a rule-based module to increase task success rate.

Comment: Matches criteria 3 as it discusses new methods for drone navigation with hybrid AI strategies. Relevance: 5 Novelty: 6

13. FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection

ArXiv ID: 2501.04373 Authors: Guoxin Zhang, Ziying Song, Lin Liu, Zhonghong Ou

Abstract: Multimodal 3D object detection has garnered considerable interest in autonomous driving. However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely, which leads to sub-optimal fusion performance. In this paper, we propose a multimodal framework FGU3R to tackle the issue mentioned above via unified 3D representation and fine-grained fusion, which consists of two important components. First, we propose an efficient feature extractor for raw and pseudo points, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal features synchronously and aggregates the features from different types of points on key points based on multimodal interaction. Second, a Cross-Attention Adaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of Interest) features adaptively via a cross-attention variant in a fine-grained manner. Together they make fine-grained fusion on unified 3D representation. The experiments conducted on the KITTI and nuScenes show the effectiveness of our proposed method.

Comment: Matches criteria 3 as it proposes a novel method for multimodal 3D object detection. Relevance: 5 Novelty: 6

14. Edit as You See: Image-guided Video Editing via Masked Motion Modeling

ArXiv ID: 2501.04325 Authors: Zhi-Lin Huang, Yixuan Liu, Chujun Qin, Zhongdao Wang, Dong Zhou, Dong Li, Emad Barsoum

Abstract: Recent advancements in diffusion models have significantly facilitated text-guided video editing. However, there is a relative scarcity of research on image-guided video editing, a method that empowers users to edit videos by merely indicating a target object in the initial frame and providing an RGB image as reference, without relying on the text prompts. In this paper, we propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff for the image-guided video editing. IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video. Inspired by self-supervised learning concepts, we introduce a masked motion modeling fine-tuning strategy that empowers the motion module’s capabilities for capturing inter-frame motion dynamics, while preserving the capabilities for intra-frame semantic correlations modeling of the base image editing model. Moreover, an optical-flow-guided motion reference network is proposed to ensure the accurate propagation of information between edited video frames, alleviating the misleading effects of invalid information. We also construct a benchmark to facilitate further research. The comprehensive experiments demonstrate that our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.

Comment: Matches criteria 3 as it introduces a new benchmark for image-guided video editing. Relevance: 5 Novelty: 6

15. Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

ArXiv ID: 2501.04144 Authors: Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

Abstract: In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects – we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at https://github.com/kamwoh/chirpy3d.

Comment: Does not match any specific criteria but is related to generative modeling in multi-modal learning. Relevance: 3 Novelty: 7

16. ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning

ArXiv ID: 2501.04284 Authors: Hyungjin Chung, Dohun Lee, Zihui Wu, Byung-Hoon Kim, Katherine L. Bouman, Jong Chul Ye

Abstract: Compressed sensing MRI seeks to accelerate MRI acquisition processes by sampling fewer k-space measurements and then reconstructing the missing data algorithmically. The success of these approaches often relies on strong priors or learned statistical models. While recent diffusion model-based priors have shown great potential, previous methods typically ignore clinically available metadata (e.g. patient demographics, imaging parameters, slice-specific information). In practice, metadata contains meaningful cues about the anatomy and acquisition protocol, suggesting it could further constrain the reconstruction problem. In this work, we propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process. We train a pixel-space diffusion model directly on minimally processed, complex-valued MRI images. During inference, metadata is converted into a structured text prompt and fed to the model via CLIP text embeddings. By conditioning the prior on metadata, we unlock more accurate reconstructions and show consistent gains across multiple datasets, acceleration factors, and undersampling patterns. Our experiments demonstrate that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance. This work highlights the untapped potential of leveraging clinical context for inverse problems and opens a new direction for metadata-driven MRI reconstruction.

Comment: Does not match any specific criteria but is related to computer vision and machine learning. Relevance: 3 Novelty: 6

17. Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

ArXiv ID: 2501.04121 Authors: Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

Abstract: Egocentric videos capture scenes from a wearer’s viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

Comment: Does not closely match any specific criteria but is relevant to the general interest area of multi-modal learning. Relevance: 3 Novelty: 5

18. NSA: Neuro-symbolic ARC Challenge

ArXiv ID: 2501.04424 Authors: Pawe{\l} Batorski, Jannik Brinkmann, Paul Swoboda

Abstract: The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.

Comment: No specific criteria match. Relevance: 3 Novelty: 5

19. An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks

ArXiv ID: 2501.04329 Authors: Lei Liu, Zhenghao Chen, Zhihao Hu, Dong Xu

Abstract: While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ball'e2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.

Comment: No specific criteria match. Relevance: 3 Novelty: 5

20. TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning

ArXiv ID: 2501.04293 Authors: Seungmin Baek, Soyul Lee, Hayeon Jo, Hyesong Choi, Dongbo Min

Abstract: Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods.

Comment: Relevance: 3 Novelty: 5

21. Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

ArXiv ID: 2501.04376 Authors: Xinghe Fu, Zhiyuan Yan, Taiping Yao, Shen Chen, Xi Li

Abstract: The generalization problem is broadly recognized as a critical challenge in detecting deepfakes. Most previous work believes that the generalization gap is caused by the differences among various forgery methods. However, our investigation reveals that the generalization issue can still occur when forgery-irrelevant factors shift. In this work, we identify two biases that detectors may also be prone to overfitting: position bias and content bias, as depicted in Fig. 1. For the position bias, we observe that detectors are prone to lazily depending on the specific positions within an image (e.g., central regions even no forgery). As for content bias, we argue that detectors may potentially and mistakenly utilize forgery-unrelated information for detection (e.g., background, and hair). To intervene these biases, we propose two branches for shuffling and mixing with tokens in the latent space of transformers. For the shuffling branch, we rearrange the tokens and corresponding position embedding for each image while maintaining the local correlation. For the mixing branch, we randomly select and mix the tokens in the latent space between two images with the same label within the mini-batch to recombine the content information. During the learning process, we align the outputs of detectors from different branches in both feature space and logit space. Contrastive losses for features and divergence losses for logits are applied to obtain unbiased feature representation and classifiers. We demonstrate and verify the effectiveness of our method through extensive experiments on widely used evaluation datasets.

Comment: Relevance: 3 Novelty: 5

22. LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition

ArXiv ID: 2501.04204 Authors: Bowen Hao, Dongliang Zhou, Xiaojie Li, Xingyu Zhang, Liang Xie, Jianlong Wu, Erwei Yin

Abstract: Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the efficient integration of temporal information, directing the model’s focus toward the relevant segments of speech, thereby enhancing discriminative capabilities. Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset and exhibits even more pronounced advantages under challenging conditions.

Comment: Does not match any specific criteria but is related to visual speech recognition, which is a computer vision application. Relevance: 3 Novelty: 5

23. Generative Dataset Distillation Based on Self-knowledge Distillation

ArXiv ID: 2501.04202 Authors: Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: Dataset distillation is an effective technique for reducing the cost and complexity of model training while maintaining performance by compressing large datasets into smaller, more efficient versions. In this paper, we present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data, thereby capturing the overall structure and relationships within the data. To further improve the accuracy of alignment, we introduce a standardization step on the logits before performing distribution matching, ensuring consistency in the range of logits. Through extensive experiments, we demonstrate that our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.

Comment: Does not match any specific criteria but is related to generative modeling. Relevance: 3 Novelty: 5

24. iFADIT: Invertible Face Anonymization via Disentangled Identity Transform

ArXiv ID: 2501.04390 Authors: Lin Yuan, Kai Liang, Xiong Li, Tao Wu, Nannan Wang, Xinbo Gao

Abstract: Face anonymization aims to conceal the visual identity of a face to safeguard the individual’s privacy. Traditional methods like blurring and pixelation can largely remove identifying features, but these techniques significantly degrade image quality and are vulnerable to deep reconstruction attacks. Generative models have emerged as a promising solution for anonymizing faces while preserving a natural appearance.However, many still face limitations in visual quality and often overlook the potential to recover the original face from the anonymized version, which can be valuable in specific contexts such as image forensics. This paper proposes a novel framework named iFADIT, an acronym for Invertible Face Anonymization via Disentangled Identity Transform.The framework features a disentanglement architecture coupled with a secure flow-based model: the former decouples identity information from non-identifying attributes, while the latter transforms the decoupled identity into an anonymized version in an invertible manner controlled by a secret key. The anonymized face can then be reconstructed based on a pre-trained StyleGAN that ensures high image quality and realistic facial details. Recovery of the original face (aka de-anonymization) is possible upon the availability of the matching secret, by inverting the anonymization process based on the same set of model parameters. Furthermore, a dedicated secret-key mechanism along with a dual-phase training strategy is devised to ensure the desired properties of face anonymization. Qualitative and quantitative experiments demonstrate the superiority of the proposed approach in anonymity, reversibility, security, diversity, and interpretability over competing methods.

Comment: Does not match any specific criteria but is related to computer vision and generative modeling. Relevance: 3 Novelty: 5

25. Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

ArXiv ID: 2501.04666 Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer

Abstract: Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schr"odinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schr"odinger Bridge’s noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB improves the overall image quality. In user studies, our model is preferred by the users in an average of 59% of cases.

Comment: Does not match any specific criteria but is related to computer vision and generative modeling. Relevance: 3 Novelty: 5

26. DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models

ArXiv ID: 2501.04304 Authors: Hyogon Ryu, NaHyeon Park, Hyunjung Shim

Abstract: Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters.

Comment: Does not closely match any specific criteria but is relevant to the general interest area of generative modeling. Relevance: 3 Novelty: 5

27. Research on environment perception and behavior prediction of intelligent UAV based on semantic communication

ArXiv ID: 2501.04480 Authors: Kechong Ren, Li Gao, Qi Guan

Abstract: The convergence of drone delivery systems, virtual worlds, and blockchain has transformed logistics and supply chain management, providing a fast, and environmentally friendly alternative to traditional ground transportation methods;Provide users with a real-world experience, virtual service providers need to collect up-to-the-minute delivery information from edge devices. To address this challenge, 1) a reinforcement learning approach is introduced to enable drones with fast training capabilities and the ability to autonomously adapt to new virtual scenarios for effective resource allocation.2) A semantic communication framework for meta-universes is proposed, which utilizes the extraction of semantic information to reduce the communication cost and incentivize the transmission of information for meta-universe services.3) In order to ensure that user information security, a lightweight authentication and key agreement scheme is designed between the drone and the user by introducing blockchain technology. In our experiments, the drone adaptation performance is improved by about 35\%, and the local offloading rate can reach 90\% with the increase of the number of base stations. The semantic communication system proposed in this paper is compared with the Cross Entropy baseline model. Introducing blockchain technology the throughput of the transaction is maintained at a stable value with different number of drones.

Comment: No specific criteria match. Relevance: 3 Novelty: 4

28. Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images

ArXiv ID: 2501.04217 Authors: Ren Tasai, Guang Li, Ren Togo, Minghui Tang, Takaaki Yoshimura, Hiroyuki Sugimori, Kenji Hirata, Takahiro Ogawa, Kohsuke Kudo, Miki Haseyama

Abstract: We propose a novel continual self-supervised learning method (CSSL) considering medical domain knowledge in chest CT images. Our approach addresses the challenge of sequential learning by effectively capturing the relationship between previously learned knowledge and new information at different stages. By incorporating an enhanced DER into CSSL and maintaining both diversity and representativeness within the rehearsal buffer of DER, the risk of data interference during pretraining is reduced, enabling the model to learn more richer and robust feature representations. In addition, we incorporate a mixup strategy and feature distillation to further enhance the model’s ability to learn meaningful representations. We validate our method using chest CT images obtained under two different imaging conditions, demonstrating superior performance compared to state-of-the-art methods.

Comment: Relevance: 3 Novelty: 4

Paper selection prompt

New methodological improvements to spatial understanding, spatial intelligence on embodied agents;
Shows new VLLMs (visual large language models) or MLLMs (multi-modal large language models)
Embodied AI papers on buliding new benchmark (simulator related) or new methods. These papers should focus on novel angles that previous work ignored.
Vision foundation models related and its applications.

In suggesting papers to your friend, remember that he enjoys papers on computer vision and machine learning, and generative modeling in multi-modal learning. Your friend also likes learning about surprising empirical or insightful results in vision-language models or embodied AI, as well as clever statistical tricks.

Yifan Li (Jack)

Personalized Daily ArXiv Papers 01/09/2025

0. Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

1. Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

2. RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark

3. EditAR: Unified Conditional Generation with Autoregressive Models

4. Supervision-free Vision-Language Alignment

5. MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

6. InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

7. FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using Gaussian Splatting with Depth-Feature Consistency

8. Online Gaussian Test-Time Adaptation of Vision-Language Models

9. Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

10. Benchmarking Large and Small MLLMs

11. DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

12. Hybrid Artificial Intelligence Strategies for Drone Navigation

13. FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection

14. Edit as You See: Image-guided Video Editing via Masked Motion Modeling

15. Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

16. ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning

17. Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

18. NSA: Neuro-symbolic ARC Challenge

19. An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks

20. TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning

21. Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

22. LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition

23. Generative Dataset Distillation Based on Self-knowledge Distillation

24. iFADIT: Invertible Face Anonymization via Disentangled Identity Transform

25. Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

26. DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models

27. Research on environment perception and behavior prediction of intelligent UAV based on semantic communication

28. Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images

Paper selection prompt