Paper
▲ 27
•
research-paper
•
advanced
- Attaching a learnable scalar multiplier to each weight matrix lets the model escape the suboptimal weight‑norm equilibrium imposed by fixed weight decay.
- Extending this idea to per‑row and per‑column multipliers further frees individual dimension scales, yielding a more expressive variant of μP‑style scaling.
Paper
▲ 16
•
research-paper
•
advanced
- RelayLLM lets a small language model act as a controller, emitting a special command token to summon the large model only for critical tokens, reducing LLM usage to ~1 % of generated tokens.
- A two‑stage training regimen (warm‑up plus Group Relative Policy Optimization) teaches the SLM when to generate autonomously and when to request help, balancing independence with strategic assistance.
Paper
▲ 5
•
research-paper
•
advanced
- Pure LLM judges often mis‑evaluate complex, multi‑step outputs because they lack explicit reasoning and verification mechanisms.
- The paper introduces a modular “agent‑as‑judge” system that first plans an evaluation strategy, then invokes external tools (e.g., calculators, code runners) to verify intermediate claims.
Paper
research-paper
•
advanced
- Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
- Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.
Paper
research-paper
•
advanced
- Traditional Transformers and RNNs reside in a “Metric Phase” where causal order can be broken by semantic noise, causing hallucinations.
- By formulating inference as a Symmetry‑Protected Topological (SPT) phase, logical operations become analogous to non‑Abelian anyon braiding, giving them immunity to local perturbations.
Paper
research-paper
•
advanced
- Making SSM parameters input‑dependent gives the model content‑based gating, allowing selective propagation or forgetting of information and closing the performance gap with attention on discrete modalities.
- A hardware‑aware parallel recurrence algorithm restores efficiency lost by dropping convolutions, delivering true linear‑time computation with constant‑factor speedups on modern GPUs/TPUs.
Paper
research-paper
•
advanced
- Treating attention matrices as token‑level graphs lets spectral analysis separate sound from unsound mathematical proofs.
- Four graph‑spectral metrics (Fiedler value, high‑frequency energy ratio, smoothness, spectral entropy) achieve huge effect sizes (Cohen’s d ≤ 3.30) across seven models from four families, without any training or fine‑tuning.
Paper
▲ 73
•
research-paper
•
advanced
- Conventional RAG memories act as static fact repositories, neglecting the higher‑order relations needed for deep reasoning.
- HGMem models the working memory as a hypergraph where each hyperedge groups related facts, enabling progressive construction of complex relational structures.
Paper
▲ 32
•
research-paper
•
advanced
- DLCM learns variable‑length “concepts” on the fly, moving computation from dense token streams to a compact latent space where reasoning is cheaper and more focused.
- A new compression‑aware scaling law separates token‑level capacity, concept‑level reasoning capacity, and compression ratio, allowing principled FLOP allocation across the hierarchy.