← All problems

▲

Why do large language models work?

Neural Scaling Laws

Why does model performance follow power laws in compute, data, and parameters? And when should we expect it to break?

Progress

85%

Machine-verified derivation

Current approach

Formal derivation of Chinchilla scaling from information-theoretic bounds; Lean-verified power-law emergence.

Status notes

Chinchilla exponents derived from first principles and Lean-checked. Adam optimizer convergence separately formalized.

Direct contributions

3 papers

Machine Learning Working Paper Lean Flagship

Neural Scaling Laws Formalized: Why Chinchilla Works (A Machine-Verified Derivation)

Neural scaling laws — the empirical observation that test loss decreases as a power law in compute budget — are the foundation of modern AI training strategy. Every major laboratory trains billion-dollar models by extrapolating scaling curves, yet *why* these power laws hold remains unexplained.

8,417 words 20 claims

machine_learning Draft Lean

Why Neural Networks Scale: A Complete Latent-Theoretic Foundation

We present a unified mathematical theory of neural scaling laws derived from the spectral structure of data distributions. The central object is the Latent Number $\rho \in (0, \infty)$, which measures the rate at which a distribution's spectral coefficients decay.

6,042 words 1 claims

Formal Verification Working Paper Lean

Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix

Adam (Kingma & Ba, 2015) is deep learning's most-cited optimizer, with over 100,000 citations and native implementation in every major framework: TensorFlow, PyTorch, JAX, Keras. Its original convergence proof — Theorem 4.1, published at ICLR 2015 — claimed $R_T = O(\sqrt{T})$ regret on convex problems.

8,361 words 35 claims