Maximum Likelihood Phylogenetics

Overview

Maximum likelihood (ML) phylogenetics is a statistical framework that identifies the tree topology most likely to have produced the observed sequence data under an explicit model of molecular evolution. Unlike distance-based methods that collapse sequence information into a single number per pair, ML evaluates every site in the alignment, calculating the probability of the data given a tree and a substitution model. The tree with the highest likelihood score is selected as the best estimate of evolutionary history. ML methods are computationally intensive but offer well-established statistical properties, including consistency and the ability to compare competing models.

Key Concepts

The likelihood calculation depends on a substitution model — a Markov process describing the rates at which one nucleotide (or amino acid) changes into another. Commonly used models include JC69 (equal rates), GTR (general time-reversible with unequal base frequencies), and rate heterogeneity modeled via a gamma distribution (G) with a proportion of invariant sites (I). Heuristic tree searches (e.g., subtree pruning and regrafting, or SPR) navigate the vast space of possible topologies. Model selection using information criteria such as AIC or BIC identifies the best-fitting substitution model for the data.

Applications

ML phylogenetics is widely applied in molecular systematics, viral phylodynamics, and comparative genomics. It forms the backbone of many large-scale species trees and is the method of choice for most peer-reviewed phylogenetic studies. The approach is routinely used to trace DNA sequencing data from pathogen outbreaks, to study the evolutionary dynamics of viral structure and classification, and to resolve deep relationships in bacterial genetics where rapid evolution can obscure phylogenetic signal.