-
サマリー
あらすじ・解説
ArXiv Computer Vision research for Thursday, June 13, 2024.
00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data
01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth
03:08: GGHead: Fast and Generalizable 3D Gaussian Heads
04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset
06:34: Towards Vision-Language Geo-Foundation Model: A Survey
08:11: SimGen: Simulator-conditioned Driving Scene Generation
09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior
12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living
13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image
15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis
16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
19:39: Real-Time Deepfake Detection in the Real-World
21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant
24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion
28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing
31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
33:16: Towards Evaluating the Robustness of Visual State Space Models
34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images
36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras
37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach
40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
41:40: Explore the Limits of Omni-modal Pretraining at Scale
42:46: Interpreting the Weight Space of Customized Diffusion Models
43:58: Depth Anything V2
45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
48:11: Rethinking Score Distillation as a Bridge Between Image Distributions
49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding