Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix
Abstract
Adam (Kingma & Ba, 2015) is deep learning's most-cited optimizer, with over 100,000 citations and native implementation in every major framework: TensorFlow, PyTorch, JAX, Keras. Its original convergence proof — Theorem 4.1, published at ICLR 2015 — claimed \(R_T = O(\sqrt{T})\) regret on convex problems. The proof passed peer review. For three years, nobody noticed it was wrong.
In 2018, Reddi, Kale, and Kumar showed that Adam can diverge: on a simple cyclic sequence of convex functions, Adam's regret grows linearly, \(R_T = \Omega(T)\). The bug: Kingma and Ba implicitly assumed that \(\sqrt{\hat{v}_t}\) is non-decreasing in \(t\), but for any \(\beta_2 < 1\), this is false — the exponential moving average forgets past gradients, causing \(\hat{v}_t\) to decrease after large-then-small gradient sequences. The fix is AMSGrad: replace \(\hat{v}_t = v_t / (1 - \beta_2^t)\) with \(\hat{v}_t = \max(\hat{v}_{t-1}, v_t / (1 - \beta_2^t))\). One max operation restores the monotonicity that the original proof requires, recovering \(R_T = O(\sqrt{T})\) convergence.
We provide a machine-checked formalization of this story in Lean 4. Twelve files and ~96 declarations with zero sorry cover the EMA foundations, the monotonicity violation, the Reddi counterexample, and Adam's linear divergence — all substantively proved. For AMSGrad, we verify the structural argument: monotonicity of \(\hat{v}_t\) is restored by the max operation, and the \(O(\sqrt{T})\) convergence bound follows from monotonicity plus standard telescoping — though the quantitative one-step contraction linking AMSGrad's update rule to the regret inequality is taken as a hypothesis, not derived from first principles. The unified main theorem states: (A) Adam diverges with \(R_T = \Omega(T)\), (B) AMSGrad converges with \(R_T = O(\sqrt{T})\) (given the one-step contraction), (C) AMSGrad strictly dominates Adam. A new cross-domain bridge theorem further shows that AMSGrad's monotone effective rate fits the same contraction template as Bellman value iteration. If ICLR 2015 had required Lean proofs, the unproved monotonicity assumption would have been caught before publication.
One-sentence summary: The most-cited optimizer in deep learning had a bug in its convergence proof for three years; we provide a machine-checked formalization of both the bug and the structural fix.
Novelty
A zero-`sorry` Lean 4 formalization that pinpoints the exact monotonicity assumption error in Adam’s original convergence proof, machine-checks a concrete divergence construction, and machine-checks the structural (monotonicity→telescoping→regret) convergence argument for AMSGrad with one key inequality assumed.