SGD Is Right: A Machine-Checked Proof That Stochastic Gradient Descent Converges
Abstract
Stochastic gradient descent (SGD) is the algorithm that trains every neural network. From GPT-4 to AlphaFold, from DALL-E to Gemini, every parameter update is a variant of the rule proposed by Robbins and Monro in 1951: take a step in the direction of a noisy gradient, shrink the step size, repeat. The convergence of SGD is the foundational theorem of modern machine learning — yet it has never been formally verified.
We provide the first machine-checked proof that SGD converges. Fourteen Lean 4 files, approximately 95 theorems, zero sorry, zero axioms: from the convexity foundations through the stochastic gradient model, the distance contraction, and the telescoping argument to the final convergence rates. The unified main theorem states: (i) for convex functions, the average iterate converges at rate \(O(1/\sqrt{T})\), matching the Nemirovski-Yudin information-theoretic lower bound; (ii) for strongly convex functions, the last iterate converges at rate \(O(1/T)\); (iii) strong convexity strictly helps — \(1/T < 1/\sqrt{T}\) for all \(T > 1\); (iv) mini-batch SGD achieves a \(\sqrt{B}\) speedup, reducing variance by a factor of \(B\). All bounds are dimension-free: they depend on the noise variance \(\sigma^2\) and the initial distance \(D\), not on the number of parameters.
This paper is the positive companion to "Adam Is Broken" (Nagy, 2026), which proves Adam diverges with \(R_T = \Omega(T)\). Together they form the first complete verified treatment of optimization: the broken optimizer, the correct optimizer, and the fix. If the foundations of deep learning training rest on SGD, those foundations are now machine-verified.
One-sentence summary: Every neural network is trained by SGD; we prove it converges — all in Lean 4, zero sorry.