Hierarchical Language Modeling with Dynamic Concept Compression

← Back to Papers

Research Paper

Hierarchical Language Modeling with Dynamic Concept Compression

Authors: Xingwei Qu,

nlp efficiency advanced ▲ 32 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-03 • Added: 2026-01-03

Key Insights

DLCM learns variable‑length “concepts” on the fly, moving computation from dense token streams to a compact latent space where reasoning is cheaper and more focused.
A new compression‑aware scaling law separates token‑level capacity, concept‑level reasoning capacity, and compression ratio, allowing principled FLOP allocation across the hierarchy.
The decoupled μP parametrization lets hyperparameters transfer zero‑shot across model widths and compression settings, stabilizing training of heterogeneous architectures.
In experiments (R = 4, ~4 tokens per concept) DLCM redirects about one‑third of inference FLOPs to a higher‑capacity reasoning backbone, yielding ~+2.7 % average gains on 12 zero‑shot tasks at equal compute.

Abstract

Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose Dynamic Large Concept Models (DLCM), a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first compression-aware scaling law, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a decoupled μP parametrization that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting (R=4, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a +2.69\% average improvement across 12 zero-shot benchmarks under matched inference FLOPs.

Full Analysis

# Hierarchical Language Modeling with Dynamic Concept Compression **Authors:** Xingwei Qu, **Source:** [HuggingFace](https://huggingface.co/papers/2512.24617) | [arXiv](https://arxiv.org/abs/2512.24617) **Published:** 2026-01-03 **Organization:** Hugging Face ## Summary - DLCM learns variable‑length “concepts” on the fly, moving computation from dense token streams to a compact latent space where reasoning is cheaper and more focused. - A new compression‑aware scaling law separates token‑level capacity, concept‑level reasoning capacity, and compression ratio, allowing principled FLOP allocation across the hierarchy. - The decoupled μP parametrization lets hyperparameters transfer zero‑shot across model widths and compression settings, stabilizing training of heterogeneous architectures. - In experiments (R = 4, ~4 tokens per concept) DLCM redirects about one‑third of inference FLOPs to a higher‑capacity reasoning backbone, yielding ~+2.7 % average gains on 12 zero‑shot tasks at equal compute. ## Abstract Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose Dynamic Large Concept Models (DLCM), a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first compression-aware scaling law, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a decoupled μP parametrization that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting (R=4, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a +2.69\% average improvement across 12 zero-shot benchmarks under matched inference FLOPs. --- *Topics: nlp, efficiency* *Difficulty: advanced* *Upvotes: 32*