NCBI Databases: GenBank, RefSeq, and SRA

Overview

The National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine provides the world’s most comprehensive suite of molecular biology databases. GenBank, the primary nucleotide sequence repository, accepts direct submissions from researchers and participates in the International Nucleotide Sequence Database Collaboration (INSDC) alongside EMBL-EBI and DDBJ. RefSeq provides curated, non-redundant reference sequences for genomes, transcripts, and proteins. The Sequence Read Archive (SRA) stores raw sequencing reads from high-throughput platforms. Together, these resources form the backbone of public molecular data access.

Key Concepts

GenBank records include the raw sequence, biological annotations, and bibliographic metadata. Accession numbers (e.g., NM_001234) provide stable identifiers for citation. RefSeq differs from GenBank in its curation process: RefSeq records are manually reviewed and maintained, offering higher reliability for reference purposes. SRA archives unprocessed sequencing reads in a compressed format, storing both the read data and quality scores. The Entrez search system provides unified cross-database querying across all NCBI resources, enabling retrieval of related records from different databases through a single interface.

Applications

NCBI databases are used daily by researchers worldwide. DNA sequencing projects begin with BLAST searches against GenBank to identify unknown sequences. Next-generation sequencing experiments deposit raw reads in SRA and align them against RefSeq genomes for variant calling. Bacterial genetics studies rely on RefSeq genomes for comparative genomics and the identification of species-specific genes and virulence factors.