UniProt and Protein Sequence Databases

Overview

UniProt (Universal Protein Resource) is the world’s leading repository of protein sequence and functional information. It is maintained by a consortium that includes the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). UniProt integrates data from genome sequencing projects, literature curation, and computational predictions to provide a single, comprehensive view of each known protein. Its three main components — UniProtKB, UniRef, and UniParc — serve different analytical needs.

Key Concepts

UniProt Knowledgebase (UniProtKB) is divided into two sections: Swiss-Prot contains manually curated, reviewed entries with detailed annotation derived from literature; TrEMBL contains computationally analyzed entries awaiting manual review. UniRef clusters protein sequences at various identity levels (100%, 90%, 50%) to reduce redundancy and accelerate sequence similarity searches. UniParc is a comprehensive, non-redundant archive that tracks all protein sequences from the major source databases. Each UniProtKB entry includes the sequence, function information, subcellular location, post-translational modifications, domains and sites, and cross-references to other databases.

Applications

UniProt serves as the primary protein reference for most bioinformatics workflows. Researchers use it to retrieve sequences for protein structure prediction, to identify proteins in proteomics and mass spectrometry experiments via database searching, and to look up functional annotations such as enzyme classification and nomenclature for metabolic pathway studies.