Paper
research-paper
•
advanced
- Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
- **Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry.
Paper
research-paper
•
advanced
- A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp, and locally optimizing it.
- The generated dataset encodes both semantic (tool function) and geometric cues, enabling a multimodal network to predict grasps that respect the intended usage (e.g., pulling, cutting).
Paper
▲ 19
•
research-paper
•
advanced
- Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
- The generated videos serve as high‑fidelity data augmentations, enriching the visual diversity of manipulation datasets without manual collection.