K-mer Analysis: Sequence Composition and Frequency

Overview

K-mer analysis decomposes biological sequences into all possible substrings of a fixed length k and counts their frequencies. This simple yet powerful technique captures the compositional properties of genomes and transcriptomes without requiring alignment, making it computationally efficient and reference-free. K-mer frequency distributions reveal genome size, heterozygosity, repeat content, and sequencing error rates from raw reads before any assembly step. The choice of k involves a trade-off: small k values (k < 20) provide robust counts but limited discriminative power, while large k values (k > 50) offer high specificity but lower coverage.

Key Concepts

The k-mer spectrum — a histogram of k-mer occurrence frequencies — follows a Poisson-like distribution in ideal data. Erroneous k-mers from sequencing errors appear as singletons (frequency 1), while genuine genomic k-mers form a peak at the expected coverage depth. Genomic repeats produce additional peaks at higher multiplicities. Tools such as Jellyfish and KMC efficiently count k-mers using hash tables or suffix arrays. Beyond counting, k-mer-based methods include k-mer distance (the fraction of shared k-mers between two samples) for phylogeny, k-mer coverage for estimating genome size, and k-mer spectra for error correction by removing k-mers below a coverage threshold.

Applications

K-mer analysis is integral to next-generation sequencing quality control, detecting contamination and estimating coverage before assembly. In bacterial genetics, k-mer-based methods distinguish strains by their unique compositional signatures. Metagenomic binning uses k-mer frequency vectors to cluster contigs from the same organism. DNA sequencing projects use k-mer counting to correct sequencing errors by replacing erroneous bases that create low-frequency k-mers.