Proved Safe: A Machine-Verified Theory of AI Safety from the Eigenvalue Spectrum

Nagy, Tamás

Proved Safe: A Machine-Verified Theory of AI Safety from the Eigenvalue Spectrum

Tamás Nagy, Ph.D. Updated 2026-03-31 Draft Machine Learning Lean-Verified

Mathematics verified. Core theorems are machine-checked in Lean 4. Prose and presentation may not have been human-reviewed.

Download PDF View in Graph BibTeX

Abstract

We present a unified, machine-verified theory of AI safety in which the eigenvalue decay rate \(s\) of the data covariance determines a formal chain: scaling laws, self-improvement ceilings, robustness certificates, and a quantitative safety budget. Our contribution is five new connecting theorems — results that only hold because individually verified components are combined — together with the Latent construction that makes the unification precise:

(i) The Scaling–Ceiling Duality (Theorem 1). The observable scaling exponent \(\alpha\) and the self-improvement ceiling growth rate \(1/s\) are algebraically locked: \(\alpha = (s-1)/(s+1)\) and \(K^(N) \sim N^{(1-\alpha)/(1+\alpha)}\). This yields a falsifiable prediction: language models (\(\alpha \approx 0.05\)) have near-linear ceiling growth (\(K^ \sim N^{0.9}\)), while image models (\(\alpha \approx 0.33\)) have square-root growth (\(K^* \sim N^{0.50}\)). Language model self-improvement is the most dangerous regime.

(ii) The Compute–Safety Theorem (Theorem 2). For a network trained with weight decay \(\lambda_{\text{wd}}\) to compute-optimal loss \(L^* \sim AC^{-\alpha}\), the certified robustness radius satisfies \(r \geq \frac{m}{2}(\lambda_{\text{wd}}/A)^{L/2} C^{\alpha L/2}\). More compute yields safer models — not by hope, but by theorem.

(iii) The Safety Budget from Spectrum (Theorem 3). The complete safety budget is a closed-form function of the spectral parameter: \[\sigma_{\text{safety}}(s, C) \geq \frac{m \cdot A \cdot \varepsilon\lambda_2}{2} \cdot \frac{(\lambda_{\text{wd}}/A)^{L/2} \cdot C^{\alpha(L/2 - 1)}}{1 + C^{1/s}}\] where \(A\) is the loss prefactor, \(\lambda_{\text{wd}}\) the weight decay coefficient, \(\varepsilon\) the residual mixing rate, \(\lambda_2\) the attention spectral gap, and the remaining symbols are architectural constants or determined by \(s\).

(iv) The Ceiling–Safety Degradation (Theorem 4). Under self-improvement to the ceiling \(K^ \sim N^{1/s}\), the post-improvement safety budget satisfies \(\sigma_{\text{safety}}^{\text{post}} \geq \sigma_{\text{safety}}^{\text{pre}} / ((1 + K^ \alpha_g)(1 + K^*))\). The degradation rate is controlled by \(s\): high-\(s\) domains (audio, physics) degrade slowly; low-\(s\) domains (language) degrade fast.

(v) The Blind-Zone Certification Gap (Theorem 5). Of the \(K^(N)\) modes a system can operate on, only \(K^_{\text{self}}(N) \leq K^(N)\) are self-verifiable. The remaining \(K^(N) - K^*_{\text{self}}(N) = \Theta(N^{1/s})\) modes constitute a certification gap — capable but uncertified modes that require external verification (formal proof).

These five theorems formalize the arrows that were previously asserted as analogies. The underlying component results — scaling laws, self-improvement bounds, robustness certificates, atten

Length

8,080 words

Claims

8 theorems

Status

Draft

Connects To

Formal Foundations of Stochastic Gradient Descent

Browse all Machine Learning papers →