The Latent of Latents: Hierarchical Finite Representations of Knowledge Families
Abstract
The Latent Theorem guarantees that any smooth system has a finite representation whose size depends on regularity and accuracy, not on ambient dimensionality. We extend this result to families of smooth systems. When a collection of systems — indexed by a parameter such as domain, task, or model identity — varies smoothly, the family's Latent representations form a structured object that itself admits a finite Latent. We call this the Latent of Latents and formalize it as an element of a bi-graded Hilbert tensor algebra \(\Lambda^{(i,j)}\), where \(i\) is the grade within each system and \(j\) is the grade across the family.
The main theoretical result is the Hierarchical Latent Theorem: if each system in the family has a within-Latent of rank \(r\) and the family mapping has meta-analyticity parameter \(\rho_\mathrm{meta} > 1\), then the entire family is characterized by \(r \times R\) numbers, where \(R = O(\log(1/\varepsilon)/\log \rho_\mathrm{meta})\). This is doubly logarithmic in the product of ambient dimensionalities.
Applied to large language models, the framework predicts that a model's entire knowledge organization — across many domains — is captured by a meta-Latent of bounded rank. We validate all five testable predictions across three GPT-2 variants (Small 124M, Medium 355M, DistilGPT-2 82M): meta-rank \(R_{95\%} = 10\) (P1), sub-linear growth \(R \sim 3.6 \log D\) (P2), semantically interpretable meta-axes (P3), factored reconstruction with cosine similarity 0.9998 (P4), and distillation invariance \(S_\mathrm{meta} = 0.90\) (P5). Cross-scale validation on GPT-2 Medium shows meta-rank decreases to \(R_{95\%} = 6\) — within the same architecture family, larger models organize knowledge more efficiently. Cross-architecture validation on TinyLlama 1.1B (LLaMA architecture, \(d = 2048\)) reveals that meta-rank scales with hidden dimension across architecture families (\(R_{95\%} = 31\)), while the meta-axes remain semantically interpretable and cross-architecture meta-similarity reaches \(S_\mathrm{meta} = 0.74\). A controlled comparison of TinyLlama Base vs Chat shows instruction tuning has minimal effect on meta-rank (\(\Delta R = 1\), \(S_\mathrm{meta} = 0.990\)), establishing that knowledge organization is determined during pre-training. The scaling ratio \(R_{95\%}/d \approx 0.014\) is consistent across both architectures. Cross-model knowledge transfer via projected \(\Delta W\) successfully moves fine-tuning artifacts between architectures of different dimensionality (768\(\to\)1024). A pure Rust inference engine (candle, 10 tok/s) enables the full pipeline. The entire knowledge organization of GPT-2 across 41 domains compresses to 410 numbers — a 302,439:1 compression from the model's 124M parameters. Cross-validation against Anthropic's sparse autoencoder (SAE) features confirms that meta-axes and SAE features capture the