Research Papers
25 papers from HuggingFace Daily Papers
Filter by topic:
ai-ml
ai-safety
computer-vision
efficiency
multimodal
nlp
reinforcement-learning
robotics
-
Introduces a camera‑guided retrieval module that pulls relevant latent frames from a pre‑built spatio‑temporal memory, ensuring consistent geometry across different viewpoints.
-
Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
-
Introduces a “reason‑when‑necessary” policy that triggers deep reasoning only for ambiguous video frames, reducing unnecessary computation.
-
Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
-
Attaching a learnable scalar multiplier to each weight matrix lets the model escape the suboptimal weight‑norm equilibrium imposed by fixed weight decay.
-
Introduces a unified 4D representation (static background point cloud + per‑object 3D Gaussian trajectories) that captures both camera motion and object dynamics in space‑time.
-
The paper proves an Ω(T^{2/3}) information‑theoretic lower bound on expected multicalibration error even when only three disjoint binary groups are used, matching known upper bounds up to log factors.
-
Across diverse domains and architectures, a tiny, fixed subset of experts (the “standing committee”) receives the majority of routing votes, contradicting the expected domain‑specific specialization.
-
A correspondence‑based data engine turns a single human demonstration into thousands of high‑quality, category‑wide synthetic training examples by morphing object meshes, transferring the expert grasp...
-
QNeRF replaces large MLPs in NeRF with parameterised quantum circuits, exploiting superposition and entanglement to encode spatial and view‑dependent features.
-
RelayLLM lets a small language model act as a controller, emitting a special command token to summon the large model only for critical tokens, reducing LLM usage to ~1 % of generated tokens.
-
A compact spatio‑temporal latent space encodes an entire animation sequence in one forward pass, enabling “one‑shot” reconstruction of 3D shape and motion.
-
Introduces a hybrid pipeline that first applies a bespoke statistical gray‑pixel detector to estimate illumination in noisy, low‑light scenes.
-
Pure LLM judges often mis‑evaluate complex, multi‑step outputs because they lack explicit reasoning and verification mechanisms.
-
Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and ...
-
Introducing “visual identity prompting” supplies diffusion models with explicit object cues, enabling generation of consistent multi‑view videos that preserve object appearance across frames.
-
Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments.
-
Tokens with the highest predictive entropy dominate the semantic output of V‑L models; tampering only with these few tokens yields large degradations.
-
Traditional Transformers and RNNs reside in a “Metric Phase” where causal order can be broken by semantic noise, causing hallucinations.
-
A shared hypernetwork generates client‑specific VAE decoders and class‑conditional latent priors from lightweight private codes, enabling personalization without exposing raw data.
-
Making SSM parameters input‑dependent gives the model content‑based gating, allowing selective propagation or forgetting of information and closing the performance gap with attention on discrete modal...
-
Treating attention matrices as token‑level graphs lets spectral analysis separate sound from unsound mathematical proofs.
-
Reformulates multimodal reasoning as a native image‑to‑image generation task, enabling direct manipulation of visual information instead of indirect text prompts.
-
Conventional RAG memories act as static fact repositories, neglecting the higher‑order relations needed for deep reasoning.
-
DLCM learns variable‑length “concepts” on the fly, moving computation from dense token streams to a compact latent space where reasoning is cheaper and more focused.