Transformers in Biology

Overview

Transformers are neural network architectures that process sequential data using self-attention mechanisms instead of recurrent or convolutional connections. Unlike RNNs, which process tokens one at a time, Transformers process entire sequences in parallel, with each token paying attention to all other tokens to learn contextual dependencies. This architecture has proven revolutionary for biological data: protein sequences fit naturally into self-attention, and AlphaFold has shown that transformers can predict protein structures with atomic precision from amino acid sequences. Biological language models built on transformers are now driving genomics and drug discovery.

Methods

The Transformer architecture consists of stacked encoders with multi-head self-attention and feed-forward networks using positional encoding to represent sequence order. BERT (Bidirectional Encoder Representations of Transformers) randomly masks tokens and learns to predict them from the surrounding context. autoregressive language models like GPT generate tokens sequentially, conditioned on previous tokens. In biology, AlphaFold has adapted transformers to predict residue pair distances and dihedral angles. DNABERT and SpliceBERT apply BERT to DNA and learn functional representations of genomic fragments. ESM (Evolutionary Scale Modeling) trains transformers on protein sequences to predict structure and function. Recent advances include graph transformers for chemistry and diffusion models for molecule creation.

Practical protocol

A practical workflow for genomic analysis with transformers begins with the preparation of genomic data. For a regulatory element prediction task, the researcher collects 1,000 base pair (bp) genomic sequences around known transcription start sites from ENSEMBL. The sequences are tokenized into 6-bp fragments (hexamers) with a vocabulary of 4,096 possible tokens. Pre-trained DNABERT is loaded by Hugging Face and fine-tuned on 50,000 annotated promoter regions using a 200 bp sliding window overlap for data augmentation. The training uses the AdamW optimizer with a learning rate of 2e-5 over 10 epochs with a batch size of 16 on a GPU with 24 GB of memory. The fine-tuning takes about 4 hours. The fine-tuned model achieves 91% accuracy in classifying promoter versus non-promoter regions on a retained test set. The visualization of the attention weights shows that the model learns known transcription factor binding motifs without being explicitly trained on them. The model is applied to unannotated regions of the genome and predicts 5,000 new putative promoters, 30% of which are experimentally validated by ChIP-seq. An outstanding application example: The 15 billion parameter ESM-2 model predicted the structure of an uncharacterized bacterial protein and revealed a previously unknown fold that was later confirmed by X-ray crystallography, demonstrating that transformers can discover new protein folding patterns from sequence data alone.

Applications

Transformers predict protein structures from amino acid sequences, annotate functional elements in genomes from DNA sequencing data, and model gene expression from chromatin data. They also design new proteins for biotechnological applications, generate molecules with desired properties in drug discovery, and classify pathogenic from benign variants in clinical genomics.