Ml Spectral Intelligence
Abstract
We derive neural scaling laws, transformer convergence rates, and self-improvement limits from a single principle: the eigenvalue decay of the data covariance matrix. For data with spectral exponent \(s\) (eigenvalues \(\lambda_k \sim k^{-s}\)), we prove:
1. Scaling Law: The compute-optimal loss scales as \(L^*(C) \sim C^{-(s-1)/(s+1)}\), with optimal allocation \(N \sim C^{1/(s+1)}\), \(D \sim C^{s/(s+1)}\). For language (\(s \approx 1\)): \(N \approx D\) (Chinchilla).
2. Transformer Convergence: Residual attention with spectral gap \(\lambda_2\) drives tokens to clusters at rate \((1 - \varepsilon \lambda_2)^L\).
3. Self-Improvement Limits: Synthetic data self-improvement converges per fixed compute (bounded monotone sequences) but diverges with growing compute (no fundamental ceiling).
The mathematical structure is machine-verified in Lean 4 (38 files, ~200 theorems, zero sorry). The spectral exponent \(s\) is a single measurable number that determines scaling speed, optimal resource allocation, convergence dynamics, and self-improvement rates. Qualitative predictions — the ordering of scaling behaviour across data types and the Chinchilla-optimal allocation for language — are validated against synthetic experiments. However, the hard-truncation model overpredicts exact scaling exponents by up to 50\(\times\) for structured data (\(s = 3\)); a soft-truncation correction improves the fit but introduces a new tension with Chinchilla allocation. Mapping the data spectral exponent \(s_{\text{data}}\) to the effective learning exponent \(s_{\text{eff}}\) remains the key open problem.