Sequence Assembly: Reconstructing DNA from Fragments

Overview

Sequence assembly is the computational process of reconstructing a complete genome or transcriptome from the millions or billions of short DNA fragments produced by sequencers. Because sequencing machines read only 150–300 base pairs at a time, assembly algorithms must find overlaps between reads, merge them into longer contiguous sequences (contigs), and order contigs into scaffolds. The difficulty of assembly increases with genome size, repetitive content, and heterozygosity. Assembly quality is measured by metrics such as N50 (the contig length at which 50% of the assembly is covered) and the total number of contigs.

Methods

Two main assembly paradigms exist. Overlap-layout-consensus (OLC) , used by Canu and Flye for long reads, computes all pairwise overlaps between reads, constructs a graph, and resolves paths to produce a consensus sequence. De Bruijn graph assemblers, such as SPAdes and Velvet, decompose reads into k-mers and build a graph where k-mers are nodes and edges represent k-1 overlaps; this approach scales efficiently to large genomes with high coverage. Hybrid assemblers combine the accuracy of short reads with the long-range information of long reads to resolve repetitive regions. Metagenomic assemblers like MEGAHIT handle mixed microbial communities by accommodating varying coverage depths across species.

Applications

De novo genome assembly produces reference genomes for newly sequenced organisms, including bacteria, plants, and vertebrates. In next-generation sequencing projects, assembly is the first step before annotation and analysis. Transcriptome assembly from DNA sequencing reads reveals alternatively spliced isoforms. Recombinant DNA technology uses assembly to verify plasmid constructs by assembling Sanger sequencing reads of cloned inserts.