Verified Transformer Dynamics: Token Clustering Convergence in Lean 4
Abstract
Transformers are the dominant architecture in modern machine learning — behind GPT-4, Claude, Gemini, and every frontier language model — yet no formal convergence theory exists for how self-attention drives token representations to evolve across layers. We provide the first machine-checked proof that transformer self-attention drives token representations to clusters. The key result: for a residual transformer with doubly stochastic attention matrix \(A\) having spectral gap \(\lambda_2\) and residual step size \(\varepsilon \in (0, 1]\), the token diameter satisfies
\[d(X_L) \leq (1 - \varepsilon \cdot \lambda_2)^L \cdot d_0\]
where \(d_0\) is the initial token diameter and \(L\) is the number of layers. This exponential bound implies: (i) the diameter contracts geometrically per layer, (ii) token representations converge to a consensus state as \(L \to \infty\), and (iii) for any target precision \(\delta > 0\), there exists a critical depth \(L_0\) beyond which all tokens are \(\delta\)-close. The convergence rate depends on exactly one architectural quantity: the spectral gap \(\lambda_2\) of the attention matrix.
We also establish two complementary results. The depth-diversity tradeoff (Theorem 2): deeper networks lose token diversity at exponential rate — the same mechanism that enables convergence imposes an expressiveness cost. The spectral gap necessity (Theorem 3): without a spectral gap, no contraction occurs — the convergence property is tight. The three-part main theorem — exponential bound + asymptotic convergence + effective convergence — is assembled from 12 Lean 4 files with zero sorry, constituting the first formally verified theorem about the transformer architecture. The proof imports GeometricTail from the Spectral Fenton framework, revealing that the same geometric decay governing financial risk (eigenvalue decay in portfolio correlation) governs transformer convergence (spectral gap in attention) — one spectral framework, two domains.
One-sentence summary: Transformer attention provably drives tokens to clusters at rate \((1 - \varepsilon \cdot \lambda_2)^L\), and we have the machine-checked proof.