Why Neural Networks Scale: A Complete Latent-Theoretic Foundation

Nagy, Tamás

Why Neural Networks Scale: A Complete Latent-Theoretic Foundation

Dr. Tamás Nagy Updated 2026-04-09 Draft machine_learning Lean-Verified

Mathematics verified. Core theorems are machine-checked in Lean 4. Prose and presentation may not have been human-reviewed.

Download PDF View in Graph BibTeX

Abstract

We present a unified mathematical theory of neural scaling laws derived from the spectral structure of data distributions. The central object is the Latent Number \(\rho \in (0, \infty)\), which measures the rate at which a distribution's spectral coefficients decay. From \(\rho\) alone, we derive: (1) the scaling exponent \(\alpha = \beta \cdot \log \rho\) linking optimizer efficiency \(\beta\) to data structure; (2) a spectral phase transition at \(\rho = 1\) explaining grokking; (3) generalization bounds \(O(\sqrt{N^/n})\) with concentration \(P(\text{gap} > \varepsilon) \leq 2\exp(-2n\varepsilon^2/N^)\); (4) transformer expressivity requiring \(O(N^{2})\) parameters per head; (5) the inevitability of double descent when variance saturates at \(\sigma^2 N^/n\); (6) sparse MoE efficiency \(A \cdot N^{2} < K \cdot D^2\); (7) emergent abilities as predictable phase transitions ordered by \(N^_T\); (8) catastrophic forgetting as \(\rho\) collapse; (9) information bottleneck optimality at \(N^\) modes; (10) optimization landscape smoothness \(\propto \log\rho\); (11) alignment efficiency scaling as \(N^_{\text{pref}}\) with reward hacking when \(\rho_{\text{rew}} < \rho_{\text{pref}}\); and (12) adversarial robustness governed by the attack surface \(D - N^\). The chain of lemmas is machine-checked in the Lean 4 proof environment (146 theorems in 11 files; see §15.6). The theory makes testable predictions: scaling exponents are computable from data spectra, grokking onset is predictable from \(\rho(t)\) dynamics, capability emergence ordering is determined by \(N^_T\), and adversarial vulnerability is bounded by the gap between ambient and effective dimension.

Keywords: neural scaling laws, grokking, double descent, spectral theory, Latent Number, effective dimension, transformer expressivity, sparse activation, alignment, adversarial robustness, information bottleneck

MSC 2020: 68T07 (Machine learning), 41A25 (Approximation by polynomials), 60E15 (Inequalities)

Length

6,042 words

Claims

1 theorems

Status

draft

Target

JMLR / NeurIPS

Connects To

The Latent: Finite Sufficient Representations of Smooth Syst... Formal Foundations of Stochastic Gradient Descent

Why Neural Networks Scale: A Complete Latent-Theoretic Foundation

Abstract

Connects To

Read next — this paper appears in