Nlp - Learning Library

Paper

Learnable Multipliers for Adaptive Scale in LLM Matrix Layers

▲ 27 • research-paper • advanced

Attaching a learnable scalar multiplier to each weight matrix lets the model escape the suboptimal weight‑norm equilibrium imposed by fixed weight decay.
Extending this idea to per‑row and per‑column multipliers further frees individual dimension scales, yielding a more expressive variant of μP‑style scaling.

Paper

▲ 16 • research-paper • advanced

RelayLLM lets a small language model act as a controller, emitting a special command token to summon the large model only for critical tokens, reducing LLM usage to ~1 % of generated tokens.
A two‑stage training regimen (warm‑up plus Group Relative Policy Optimization) teaches the SLM when to generate autonomously and when to request help, balancing independence with strategic assistance.

Paper

▲ 5 • research-paper • advanced

Pure LLM judges often mis‑evaluate complex, multi‑step outputs because they lack explicit reasoning and verification mechanisms.
The paper introduces a modular “agent‑as‑judge” system that first plans an evaluation strategy, then invokes external tools (e.g., calculators, code runners) to verify intermediate claims.

Paper

research-paper • advanced

Introduces GREx, a unified benchmark that expands traditional referring expression tasks (RES, REC, REG) to support single‑target, multi‑target, and no‑target expressions, enabling more realistic and flexible language‑vision interactions.
Releases gRefCOCO, the first large‑scale dataset containing annotated images with all three expression types, while remaining backward‑compatible with existing RES/REC datasets for fair comparison.

Paper

research-paper • advanced

Traditional Transformers and RNNs reside in a “Metric Phase” where causal order can be broken by semantic noise, causing hallucinations.
By formulating inference as a Symmetry‑Protected Topological (SPT) phase, logical operations become analogous to non‑Abelian anyon braiding, giving them immunity to local perturbations.

Paper

research-paper • advanced

Making SSM parameters input‑dependent gives the model content‑based gating, allowing selective propagation or forgetting of information and closing the performance gap with attention on discrete modalities.
A hardware‑aware parallel recurrence algorithm restores efficiency lost by dropping convolutions, delivering true linear‑time computation with constant‑factor speedups on modern GPUs/TPUs.

Paper

research-paper • advanced

Treating attention matrices as token‑level graphs lets spectral analysis separate sound from unsound mathematical proofs.
Four graph‑spectral metrics (Fiedler value, high‑frequency energy ratio, smoothness, spectral entropy) achieve huge effect sizes (Cohen’s d ≤ 3.30) across seven models from four families, without any training or fine‑tuning.

Paper

▲ 73 • research-paper • advanced

Conventional RAG memories act as static fact repositories, neglecting the higher‑order relations needed for deep reasoning.
HGMem models the working memory as a hypergraph where each hyperedge groups related facts, enabling progressive construction of complex relational structures.

Paper

▲ 32 • research-paper • advanced

DLCM learns variable‑length “concepts” on the fly, moving computation from dense token streams to a compact latent space where reasoning is cheaper and more focused.
A new compression‑aware scaling law separates token‑level capacity, concept‑level reasoning capacity, and compression ratio, allowing principled FLOP allocation across the hierarchy.