Biological Databases: A Comprehensive Overview

Overview

Biological databases are the foundational infrastructure of bioinformatics. They collect, curate, organize, and distribute the vast quantities of molecular data generated by modern experimental biology. These databases range from small, specialized resources maintained by individual laboratories to massive international repositories such as GenBank, which contains billions of nucleotide sequences. The value of any single dataset is multiplied when it is deposited in a public database, because it becomes discoverable, reusable, and integrable with other data types. Data standards, controlled vocabularies, and cross-referencing between databases enable this interoperability.

Key Concepts

Primary databases store original experimental data — raw nucleotide sequences, protein sequences, and three-dimensional structures — with minimal interpretation. Secondary databases contain curated, annotated, or derived information built by analyzing primary data. Sequence databases (GenBank, UniProt) store nucleotide and protein sequences. Structure databases (PDB) archive macromolecular coordinates. Functional databases (GO, KEGG, Reactome) describe biological processes, pathways, and molecular functions. Cross-references connect records across databases, enabling integrated queries.

Applications

Biological databases support virtually every area of molecular biology. Researchers use them to retrieve sequences for DNA sequencing projects, search for homologous protein structures, and look up enzyme classification and nomenclature for metabolic reconstruction. Database integration is essential for systems biology, where data from multiple sources must be combined to build predictive models of cellular behavior.