▲
Why do large language models work?
Neural Scaling Laws
Why does model performance follow power laws in compute, data, and parameters? And when should we expect it to break?
Progress
85%Machine-verified derivation
Current approach
Formal derivation of Chinchilla scaling from information-theoretic bounds; Lean-verified power-law emergence.
Status notes
Chinchilla exponents derived from first principles and Lean-checked. Adam optimizer convergence separately formalized.
Direct contributions
3 papers
Machine Learning
Working Paper
Lean
Flagship
Neural Scaling Laws Formalized: Why Chinchilla Works (A Machine-Verified Derivation)
Neural scaling laws — the empirical observation that test loss decreases as a power law in compute budget — are the foundation of modern AI training strategy. Every major laboratory trains billion-dollar models by extrapolating scaling curves, yet *why* these power laws hold remains unexplained.
machine_learning
Draft
Lean
Why Neural Networks Scale: A Complete Latent-Theoretic Foundation
We present a unified mathematical theory of neural scaling laws derived from the spectral structure of data distributions. The central object is the Latent Number $\rho \in (0, \infty)$, which measures the rate at which a distribution's spectral coefficients decay.
Formal Verification
Working Paper
Lean
Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix
Adam (Kingma & Ba, 2015) is deep learning's most-cited optimizer, with over 100,000 citations and native implementation in every major framework: TensorFlow, PyTorch, JAX, Keras. Its original convergence proof — Theorem 4.1, published at ICLR 2015 — claimed $R_T = O(\sqrt{T})$ regret on convex problems.