The Spectral Theory of LLM Understanding: When Can We Trust a Language Model?
Abstract
Large language models are deployed in medicine, law, education, and critical infrastructure — yet we cannot mathematically characterize when their outputs are trustworthy. We present a spectral framework that answers three fundamental questions about LLM internals: (1) When is the output grounded? We define a grounding ratio \(G = n \cdot a_{\max}\) from the attention matrix's spectral properties and prove \(G \geq 1 + \gamma\) for doubly stochastic attention, where \(\gamma\) is the spectral gap — the same quantity that controls token convergence. (2) When does the model hallucinate? We define the capacity mismatch \(M = (r_{\text{eff}} - d_{\text{ctx}}) / r_{\text{eff}}\) as the fraction of attention capacity exceeding the context's intrinsic dimension, and prove it is monotonically increasing in effective rank and decreasing in context richness. (3) When is a layer interpretable? We prove that the number of spectral components needed to decompose a layer's computation is \(N^ = \log(1/\varepsilon) / \log \rho\), establishing a phase transition: for \(\rho \geq \rho^\), features are monosemantic (one neuron per feature); below \(\rho^*\), features superpose and interpretation requires exponentially more effort.
All 56 narrative theorems are machine-verified with zero sorry statements; two structural facts (log positivity for \(\rho > 1\)) are trusted by the kernel. Beyond the spectral characterization, we define — motivated by Markov chain perturbation theory (Cho & Meyer, 2001) — a single scalar metric, the Perturbation Resilience Index: \(\text{PRI} = \gamma \cdot (1 - 1/G)\), where \(\gamma \in [0, 1]\) is the spectral gap (defined as \(\gamma = 1 - |\lambda_2|\), with \(\lambda_2\) the subdominant eigenvalue in modulus) and \(G\) the grounding ratio. We prove PRI \(\in [0, 1)\), monotone in both quantities, zero for uniform attention or zero spectral gap, and that the complementary Structural Risk \(\text{SR} = 1 - \text{PRI}\) provides a single-number measure of attention-structure vulnerability. (Empirical validation on GPT-2: the perturbation bound holds in 99.3% of 10,368 measurements, and SR correlates with attention instability at \(r = 0.52\), \(p < 10^{-100}\). Calibration from attention stability to output correctness on frontier models is future work.) The framework extends to multi-head composition, layer propagation, training dynamics (including grokking as an interpretability phase transition), information-theoretic bounds, and spectral concept decomposition.
One-sentence summary: We define and formally verify a single metric — the Perturbation Resilience Index — that quantifies LLM output reliability through the spectral gap and grounding ratio of the attention matrix.
Novelty
A single coherent spectral/Markov-chain style vocabulary (grounding ratio, capacity mismatch, PRI) that ties attention concentration, an effective-rank-based hallucination proxy, and a log-threshold interpretability story to the same spectral-gap quantity, packaged with a claimed machine-checked proof layer.