• Ep. 247 - Part 3 - June 13, 2024

  • 2024/06/15
  • 再生時間: 52 分
  • ポッドキャスト

Ep. 247 - Part 3 - June 13, 2024

  • サマリー

  • ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

    01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

    03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

    04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

    06:34: Towards Vision-Language Geo-Foundation Model: A Survey

    08:11: SimGen: Simulator-conditioned Driving Scene Generation

    09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

    11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

    12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

    13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

    15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

    16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

    19:39: Real-Time Deepfake Detection in the Real-World

    21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

    23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

    24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

    26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

    28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

    31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

    33:16: Towards Evaluating the Robustness of Visual State Space Models

    34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

    36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

    37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

    40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    41:40: Explore the Limits of Omni-modal Pretraining at Scale

    42:46: Interpreting the Weight Space of Customized Diffusion Models

    43:58: Depth Anything V2

    45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

    48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

    49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    続きを読む 一部表示

あらすじ・解説

ArXiv Computer Vision research for Thursday, June 13, 2024.


00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

06:34: Towards Vision-Language Geo-Foundation Model: A Survey

08:11: SimGen: Simulator-conditioned Driving Scene Generation

09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

19:39: Real-Time Deepfake Detection in the Real-World

21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

33:16: Towards Evaluating the Robustness of Visual State Space Models

34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

41:40: Explore the Limits of Omni-modal Pretraining at Scale

42:46: Interpreting the Weight Space of Customized Diffusion Models

43:58: Depth Anything V2

45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Ep. 247 - Part 3 - June 13, 2024に寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。