Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix

Nagy, Tamás

Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix

Dr. Tamás Nagy Updated 2026-03-09 Working Paper Formal Verification Lean-Verified

Abstract

Adam (Kingma & Ba, 2015) is deep learning's most-cited optimizer, with over 100,000 citations and native implementation in every major framework: TensorFlow, PyTorch, JAX, Keras. Its original convergence proof — Theorem 4.1, published at ICLR 2015 — claimed \(R_T = O(\sqrt{T})\) regret on convex problems. The proof passed peer review. For three years, nobody noticed it was wrong.

In 2018, Reddi, Kale, and Kumar showed that Adam can diverge: on a simple cyclic sequence of convex functions, Adam's regret grows linearly, \(R_T = \Omega(T)\). The bug: Kingma and Ba implicitly assumed that \(\sqrt{\hat{v}_t}\) is non-decreasing in \(t\), but for any \(\beta_2 < 1\), this is false — the exponential moving average forgets past gradients, causing \(\hat{v}_t\) to decrease after large-then-small gradient sequences. The fix is AMSGrad: replace \(\hat{v}_t = v_t / (1 - \beta_2^t)\) with \(\hat{v}_t = \max(\hat{v}_{t-1}, v_t / (1 - \beta_2^t))\). One max operation restores the monotonicity that the original proof requires, recovering \(R_T = O(\sqrt{T})\) convergence.

We provide a machine-checked formalization of this story in Lean 4. Twelve files and ~96 declarations with zero sorry cover the EMA foundations, the monotonicity violation, the Reddi counterexample, and Adam's linear divergence — all substantively proved. For AMSGrad, we verify the structural argument: monotonicity of \(\hat{v}_t\) is restored by the max operation, and the \(O(\sqrt{T})\) convergence bound follows from monotonicity plus standard telescoping — though the quantitative one-step contraction linking AMSGrad's update rule to the regret inequality is taken as a hypothesis, not derived from first principles. The unified main theorem states: (A) Adam diverges with \(R_T = \Omega(T)\), (B) AMSGrad converges with \(R_T = O(\sqrt{T})\) (given the one-step contraction), (C) AMSGrad strictly dominates Adam. A new cross-domain bridge theorem further shows that AMSGrad's monotone effective rate fits the same contraction template as Bellman value iteration. If ICLR 2015 had required Lean proofs, the unproved monotonicity assumption would have been caught before publication.

One-sentence summary: The most-cited optimizer in deep learning had a bug in its convergence proof for three years; we provide a machine-checked formalization of both the bug and the structural fix.

Length

8,361 words

Claims

35 theorems

Status

Working Paper

Target

ICML 2027 / NeurIPS 2026 MATH-AI Workshop

Novelty

A zero-`sorry` Lean 4 formalization that pinpoints the exact monotonicity assumption error in Adam’s original convergence proof, machine-checks a concrete divergence construction, and machine-checks the structural (monotonicity→telescoping→regret) convergence argument for AMSGrad with one key inequality assumed.

Connects To

Formal Foundations of Stochastic Gradient Descent

Referenced By

SGD Is Right: A Machine-Checked Proof That Stochastic Gradie...

Adam's Convergence Proof Was Wrong: A Machine-Checked Verification of the Bug and the Fix

Abstract

Novelty

Connects To

Referenced By

Read next — this paper appears in