Random Forests in Bioinformatics

Overview

Random forests are an ensemble learning method that creates multiple decision trees during training and merges their predictions through majority voting (classification) or averaging (regression). Each tree is trained on a different bootstrap sample of the data, and only a random subset of features are considered at each split, ensuring diversity between trees and reducing overfitting. In bioinformatics, random forests are popular because of their ability to process high-dimensional data, their robustness to noise, and the built-in feature importance values that reveal which biological predictors contribute the most to the predictions.

Methods

Bagging trains each tree on a bootstrap sample and averages the predictions, reducing variance without increasing bias. Random feature selection at each split ensures that the trees are uncorrelated, improving generalization. Feature Importance is measured by the average decrease in impurity (Gini or entropy) or drop in precision when permuting feature values. Out-of-Bag Error (OOB) provides an internal validation estimate during training without a separate validation set. Random forests naturally handle missing values and mixed data.

Practical protocol

A random forest workflow for cancer classification starts with a gene expression dataset of 20,000 genes and 500 samples with known cancer subtypes. The data is divided into 70% training and 30% testing. A random forest of 1,000 trees is created, sampling sqrt(p) features at each split and training each tree on a bootstrap sample. The OOB error is monitored during training to confirm that 1,000 trees is enough: the OOB error stabilizes at around 500 trees. Feature importance is ranked and shows the top 50 genes that distinguish the subtypes. The most important genes refer to known biological signaling pathways that are specific to each subtype. After validation on an independent test set, the model achieves 92% accuracy compared to 85% for a linear SVM. The model is saved as a serialized file for clinical use. The class probability values for each prediction are calibrated using Platt scaling to provide interpretable confidence values. In a drug discovery application, a random forest was trained on 10,000 small molecules with known activity against a kinase target using 2,000 molecular descriptors. The model correctly classified 88% of the test compounds, and key characteristics—molecular weight, LogP, and number of hydrogen bond donors—were consistent with established pharmacological knowledge. The model predicted 15 new active compounds from a library of 50,000 compounds, 12 of which confirmed activity in biochemical tests, a hit rate of 80% compared to 20% in a random screening.

Applications

Random forests classify cancer subtypes from gene expression profiles and qPCR, predict drug responses based on genomic and proteomic data, and identify biomarker signatures in mass spectrometry data. They also detect outliers in the quality control of high-throughput experiments and prioritize driver mutations in whole-genome sequencing data.