Supervised Learning for Biological Classification

Overview

Supervised learning uses a dataset of input-output pairs — where the correct answer is known — to train a model that can predict the output for new, unseen inputs. In bioinformatics, this paradigm maps naturally onto problems such as classifying a tissue sample as cancerous or healthy, predicting whether a genetic variant is pathogenic, or assigning a function to an uncharacterized protein. The model learns decision boundaries that separate classes in feature space, guided by a loss function that penalizes prediction errors.

Methods

Logistic regression models the probability of class membership and remains a baseline for binary classification, especially when interpretability is valued. Support vector machines construct hyperplane separators with maximum margin and perform well in high-dimensional settings typical of omics data. Random forests aggregate many decision trees trained on bootstrapped samples and provide built-in feature importance measures. Gradient-boosted trees (XGBoost, LightGBM) iteratively correct predecessor errors and often win structured-data competitions. Neural networks with hidden layers can learn complex non-linear boundaries but require more data and careful regularization. All methods benefit from class-weight adjustment or resampling when classes are imbalanced.

Applications

Supervised learning classifies bacterial pathogens using genomic markers in bacterial genetics, predicts patient prognosis in cancer biochemistry, and identifies differentially expressed genes from DNA microarrays and gene expression data. It also powers clinical decision support by stratifying patients into risk groups and annotating regulatory elements across the genome.