Phylogenetic Tree Reconstruction via the Latent Framework
Abstract
Phylogenetic inference turns molecular sequences into historical relationships. Classical likelihood-based methods excel in practice, yet the information geometry of alignment data relative to tree topology is rarely summarized in coordinates comparable across studies. Such comparability matters when asking how much signal remains at a given evolutionary distance, or how many samples are needed to resolve a deep split.
This paper studies phylogenies through the Latent framework applied to alignment tensors induced by substitution models. The Latent Number \(\rho\) measures compressibility of site-pattern variation relative to a saturated multinomial baseline, while the effective dimension \(N^\ast\) counts orthogonal signal directions needed to reconstruct tree features within tolerance. Together they quantify the intrinsic difficulty of a phylogenetic instance beyond raw sequence length.
We then define a confidence-aware reconstruction rule, Latent-Gated Quartet Assembly (LGQA). Each quartet is embedded in Latent coordinates, scored against its three possible splits, and accepted only when the best-versus-second-best margin exceeds a threshold calibrated by \(\rho\), \(N^\ast\), and effective sample size; otherwise the quartet is left unresolved. The accepted quartets are then aggregated with confidence weights into a global tree, so the method avoids forcing decisions in intrinsically low-signal regimes.
Thirty-six machine-checked theorems in the Lean 4 proof bundle (elysium/fields/bio_phylogenetics/platonic.py), grouped into six themes—mutation-model constraints, signal decay along branches, reconstruction stability, topology sensitivity, molecular-clock embeddings, and cross-domain bridges to coalescent and epidemic models—encode the real-arithmetic dependency scaffold used by the Latent pipeline and by the proposed gating rule. They record ordering and positivity structure at the level verified by the kernel, while classical JC69 spectral algebra is standard material cited from the references. Numerical experiments under Jukes–Cantor evolution on four-taxon star-like trees, balanced eight-taxon trees, and eight-taxon caterpillars show reconstruction error below \(0.5\) in the Latent metric, monotone decay of signal with phylogenetic distance, and error reduction as sample size \(M\) grows. Sixteen of sixteen tests pass; these experiments validate the ingredients entering LGQA under JC69 rather than constituting a full head-to-head benchmark against mature phylogenetic software.
Novelty
Applying the Latent framework's compressibility measure (rho) and effective dimension (N*) to phylogenetic pattern tensors, and turning them into a confidence-aware Latent-Gated Quartet Assembly (LGQA) rule for abstention-aware tree reconstruction. The core JC69 spectral algebra remains standard.