Hidden Markov Models in Sequence Analysis

Overview

A hidden Markov model (HMM) is a statistical model that represents a sequence of observable events as being generated by an underlying sequence of unobserved (hidden) states. In bioinformatics, HMMs model biological sequences where the hidden states might represent exon/intron boundaries, protein secondary structure elements, or conserved columns in a multiple sequence alignment. The power of the HMM framework lies in its ability to capture position-specific conservation patterns, insertions, and deletions through a unified probabilistic architecture trained from known examples.

Key Concepts

An HMM is defined by three sets of parameters: transition probabilities between hidden states, emission probabilities of observing a symbol from each state, and initial state probabilities. The Viterbi algorithm finds the most probable sequence of hidden states for a given observation — for example, the most likely gene structure for a genomic DNA sequence. The forward-backward algorithm computes the posterior probability of each state at each position, which can be used to assess prediction confidence. Profile HMMs, built from multiple sequence alignments, model protein domain families. The HMMER software package uses profile HMMs for sensitive remote homology detection, outperforming BLAST for divergent sequences.

Applications

HMMs are used for gene prediction in prokaryotic and eukaryotic genomes, identifying splice sites and coding regions. Profile HMMs classify proteins into families and superfamilies, aiding protein structure prediction and functional annotation. They model substrate specificity in enzyme classification and nomenclature and detect regulatory elements in DNA structure and topology. In metagenomics, HMMs assign functional roles to fragments of unknown origin.