SGD Is Right: A Machine-Checked Proof That Stochastic Gradient Descent Converges

Nagy, Tamás

SGD Is Right: A Machine-Checked Proof That Stochastic Gradient Descent Converges

Tamás Nagy Draft Formal Verification Lean-Verified

Mathematics verified. Core theorems are machine-checked in Lean 4. Prose and presentation may not have been human-reviewed.

Download PDF View in Graph BibTeX

Abstract

Stochastic gradient descent (SGD) is the algorithm that trains every neural network. From GPT-4 to AlphaFold, from DALL-E to Gemini, every parameter update is a variant of the rule proposed by Robbins and Monro in 1951: take a step in the direction of a noisy gradient, shrink the step size, repeat. The convergence of SGD is the foundational theorem of modern machine learning — yet it has never been formally verified.

We provide the first machine-checked proof that SGD converges. Fourteen Lean 4 files, approximately 95 theorems, zero sorry, zero axioms: from the convexity foundations through the stochastic gradient model, the distance contraction, and the telescoping argument to the final convergence rates. The unified main theorem states: (i) for convex functions, the average iterate converges at rate \(O(1/\sqrt{T})\), matching the Nemirovski-Yudin information-theoretic lower bound; (ii) for strongly convex functions, the last iterate converges at rate \(O(1/T)\); (iii) strong convexity strictly helps — \(1/T < 1/\sqrt{T}\) for all \(T > 1\); (iv) mini-batch SGD achieves a \(\sqrt{B}\) speedup, reducing variance by a factor of \(B\). All bounds are dimension-free: they depend on the noise variance \(\sigma^2\) and the initial distance \(D\), not on the number of parameters.

This paper is the positive companion to "Adam Is Broken" (Nagy, 2026), which proves Adam diverges with \(R_T = \Omega(T)\). Together they form the first complete verified treatment of optimization: the broken optimizer, the correct optimizer, and the fix. If the foundations of deep learning training rest on SGD, those foundations are now machine-verified.

One-sentence summary: Every neural network is trained by SGD; we prove it converges — all in Lean 4, zero sorry.

Length

6,381 words

Claims

14 theorems

Status

paper_done

Target

JMLR (primary) / NeurIPS 2026 MATH-AI Workshop (secondary)

Connects To

Adam's Convergence Proof Was Wrong: A Machine-Checked Verifi... Formal Foundations of Stochastic Gradient Descent

Referenced By

Eigenvalue Conditioning: A Universal Computational Primitive... Eigenvalue Conditioning as Universal Optimizer: Cross-Domain...

Browse all Formal Verification papers →