When Q-Learning Meets Black-Scholes: A Machine-Verified Bridge Between Reinforcement Learning and Option Pricing

Nagy, Tamás

When Q-Learning Meets Black-Scholes: A Machine-Verified Bridge Between Reinforcement Learning and Option Pricing

Tamás Nagy, Ph.D. Updated 2026-03-05 Draft Quantitative Finance Lean-Verified

Mathematics verified. Core theorems are machine-checked in Lean 4. Prose and presentation may not have been human-reviewed.

Download PDF View in Graph BibTeX

Abstract

Reinforcement learning and quantitative finance independently discovered the same mathematical framework. Q-learning's Bellman update \(Q \leftarrow R + \gamma \max_{a'} Q(s',a')\) and the COS backward induction for American options \(V_t = \max\{g, \, e^{-r\Delta t} \sum_j M_{kj} V_{t+1,j}\}\) are syntactically different but mathematically identical: both are fixed-point iterations of a \(\gamma\)-contraction on a value function space. We formalize this equivalence — and seven related structural connections — in Lean 4 (28 Lean declarations comprising type infrastructure, definitional lemmas, and five substantive proofs; zero sorry; compiled via lake build), establishing, to our knowledge, the first machine-verified formalization that RL value iteration and American option backward induction instantiate the same Bellman operator.

The formalization spans three tiers. The infrastructure tier defines MDP types, LP structures, and conversion functions that bridge the RL and finance vocabularies. The structural tier contains five substantive proofs — including actionvalue_shift_bound (genuine algebra with sum manipulation), residual_strong_monotonicity (triangle inequality argument), merton_hamiltonian_at_optimum (field simplification on a meaningful Hamiltonian expression), and bellman_lp_feasibility_equiv (definition unfolding with real structure). The definitional tier verifies type-level correspondences (e.g., that the COS backward step has the same type signature as Bellman value iteration) via Lean's definitional equality checker. We are explicit: several equivalences (Theorems D, E, F) are arithmetically shallow — their value lies in the verified infrastructure connecting two historically separate formalisms, not in proof complexity.

The equivalence enables bidirectional algorithmic transfer:

(RL \(\to\) Finance.) (i) Momentum-accelerated backward induction: applying Polyak/Nesterov momentum to the COS backward step yields faster convergence for deep out-of-the-money American options, where the standard fixed-point iteration stalls. (ii) Experience replay for volatility surface calibration: store \((K, T, \sigma_{\text{impl}})\) tuples and replay them during COS calibration, mitigating catastrophic forgetting across maturities. (iii) Neural value function approximation: replace the COS coefficient grid with a neural network, enabling continuous-state American pricing without discretization.

(Finance \(\to\) RL.) (iv) Risk-constrained policy optimization: import coherent risk measures (subadditive, monotone, positive-homogeneous, translation-invariant) into the RL reward, guaranteeing that safe policies compose safely. (v) Shadow prices for constraint budgets: the Lagrange multiplier \(\lambda^ = \partia

Length

7,990 words

Claims

1 theorems

Status

Draft

Target

NeurIPS 2026 (Finance & RL Workshop) / Journal of Financial Economics / Mathematical Finance

Novelty

The genuine intellectual delta is the formal Lean 4 infrastructure connecting RL and finance vocabularies — the observation that Bellman and COS backward induction are the same contraction is well-known informally, but nobody has built a machine-checked type-level bridge between the two formalisms before.

Connects To

Formal Foundations of Stochastic Gradient Descent

Browse all Quantitative Finance papers →