Overview
Multiple sequence alignment (MSA) aligns three or more biological sequences to identify conserved regions shared across an entire family. Whereas pairwise alignment reveals similarity between two sequences, MSA captures the evolutionary depth of a homologous group, highlighting residues that have been maintained over millions of years. These conserved positions are often critical for structure, catalysis, or regulation. MSA is the prerequisite for phylogenetic tree construction, protein domain identification, and the generation of sequence logos that visualize conservation patterns.
Key Concepts
Progressive alignment, implemented in Clustal Omega and MUSCLE, builds an MSA by first constructing a guide tree from pairwise distances and then iteratively aligning the most closely related sequences. Iterative methods refine the initial alignment by realigning subsets to improve the overall objective score. Consistency-based tools like T-Coffee incorporate information from pairwise alignments against a third sequence to improve accuracy. For very large datasets, MAFFT uses fast Fourier transforms to accelerate alignment. Quality assessment metrics such as the sum-of-pairs score and column score evaluate alignment reliability, and trimming tools remove poorly aligned regions before downstream analysis.
Applications
MSA is indispensable for comparative genomics of bacterial genetics, where it identifies genes conserved across pathogenic strains. It improves the sensitivity of homology searching in DNA sequencing projects and reveals functionally important residues in protein structure prediction. In evolutionary biology, MSA provides the multiple-sequence input required for maximum-likelihood and Bayesian phylogenetic inference, enabling reconstruction of ancestral sequences and dating of speciation events.