Why Does LoRA Work? The Spectral Theory of Low-Rank Adaptation
Abstract
Low-Rank Adaptation (LoRA; Hu et al., 2021) fine-tunes large language models by adding rank-\(r\) updates \(\Delta W = AB\) with \(r \ll d\). In practice, \(r = 4\)--\(16\) works remarkably well, but no theory explains why or predicts the optimal \(r\) for a given task. We provide a spectral theory: the fine-tuning data's eigenvalue spectrum decays at rate \(\rho\), and the optimal LoRA rank is \(K^ = \lceil\log(1/\tau_{\text{MP}})/\log\rho\rceil\), where \(\tau_{\text{MP}}\) is the Marchenko--Pastur noise threshold. This counts the number of eigenvalues (signal modes) above the random matrix noise floor. On 10 synthetic fine-tuning tasks with controlled \(\rho\), \(K^\) matches the empirically optimal rank in 9 out of 10 cases (90\% match rate, mean error 0.8 ranks). The spectral decay rate \(\rho\) is estimable from the data alone via SVD of the OLS solution, with 1--12\% accuracy. The practical implication: compute \(\rho\) from your fine-tuning dataset, calculate \(K^\), set LoRA rank = \(K^\). No hyperparameter search needed. The theoretical foundation is the Universal Spectral Representation Theorem (Nagy, 2026b), which guarantees dimension-free convergence.