Neural Scaling Laws Formalized: Why Chinchilla Works (A Machine-Verified Derivation)
Abstract
Neural scaling laws — the empirical observation that test loss decreases as a power law in compute budget — are the foundation of modern AI training strategy. Every major laboratory trains billion-dollar models by extrapolating scaling curves, yet why these power laws hold remains unexplained. We derive neural scaling laws from a single structural assumption: the eigenvalue spectrum of the data covariance matrix decays as \(\lambda_k \sim k^{-s}\), where the spectral exponent \(s > 1\) characterizes the data distribution. From this assumption, the scaling law follows. Under a hard-truncation (sharp-cutoff) model, the optimal test loss scales as \(L^*(C) \sim C^{-(s-1)/(s+1)}\); under the more realistic soft-truncation model appropriate for ridge regression and neural networks, the exponent is \((s-1)/s\). In both cases, the Chinchilla result — \(N \propto D\), the rule that launched a billion-dollar rebalancing of compute at every frontier lab — emerges as the limiting case \(s \to 1^+\) corresponding to Zipfian (language) data. For \(s = 2\) (natural images): \(N \sim C^{1/3}\), \(D \sim C^{2/3}\) — images need proportionally more data per parameter than language. For \(s = 3\) (audio): \(N \sim C^{1/4}\), \(D \sim C^{3/4}\). These are falsifiable predictions. We also show that grokking — the sudden emergence of generalization after prolonged memorization — admits a spectral explanation: if mode \(k\) takes time \(t_k \sim k^s\) to learn, slow generalization modes emerge abruptly after fast memorization modes saturate. The hard-truncation model (73 theorems across 12 files) is formally verified in Lean 4 with Mathlib, with zero sorry and no unresolved goals. Preliminary numerical validation on synthetic spectral data confirms the qualitative predictions. This is the first machine-checked derivation of neural scaling laws.