Learning Library

← Back to Library

Multimodal

8 items in this topic

Paper

PlenopticDreamer: Coherent Multi‑View Video Synthesis

  • Introduces a camera‑guided retrieval module that pulls relevant latent frames from a pre‑built spatio‑temporal memory, ensuring consistent geometry across different viewpoints.
  • Employs progressive training (stage‑wise spatial then temporal finetuning) to stabilize GAN learning and significantly boost temporal coherence without sacrificing spatial detail.
Paper

Efficient Video Reasoning with Dual-Answer Training

  • Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
  • Proposes a “Thinking Once, Answering Twice” paradigm where the model generates an intermediate reasoning trace before producing two complementary answers, improving answer consistency.
Paper

4D Geometric Control for Realistic Video World Modeling

  • Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
  • Leverages this representation as conditioning for a pretrained video diffusion model, yielding view‑consistent, high‑fidelity videos that strictly follow specified 4D motions.
Paper

One‑Shot Functional Dexterous Grasp Learning via Synthetic Transfer

  • A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
  • The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).
Paper

Generalized Referring Expressions for Multi‑Target Vision‑Language Tasks

  • Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
  • Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.
Paper

Entropy‑Guided Token Attacks on Vision‑Language Models

  • Tokens with the highest predictive entropy dominate the semantic output of V‑L models; tampering only with these few tokens yields large degradations.
  • Entropy‑driven attacks achieve comparable (or greater) success with far lower perturbation budgets than naïve or gradient‑based token attacks.
Paper

Mamba: Fast Linear‑Time Sequence Modeling with Input‑Conditioned State Spaces

  • Making SSM parameters input‑dependent gives the model content‑based gating, allowing selective propagation or forgetting of information and closing the performance gap with attention on discrete modalities.
  • A hardware‑aware parallel recurrence algorithm restores efficiency lost by dropping convolutions, delivering true linear‑time computation with constant‑factor speedups on modern GPUs/TPUs.
Paper

DiffThinker: Diffusion‑Based Generative Multimodal Reasoning

  • Reformulates multimodal reasoning as a native image‑to‑image generation task, enabling direct manipulation of visual information instead of indirect text prompts.
  • Demonstrates four intrinsic advantages—efficiency, controllability, native parallelism, and seamless collaboration between vision and language modules—leading to more logically consistent and spatially precise outputs.