Feature Selection in Biological Data Analysis

Overview

Feature selection is the process of identifying a subset of the most informative variables — genes, proteins, metabolites, or spectral peaks — from a high-dimensional dataset. Biological experiments often measure tens of thousands of features from relatively few samples, creating a “large p, small n” problem that leads to overfitting and poor generalization. Feature selection mitigates this by removing irrelevant and redundant variables, improving model accuracy, reducing training time, and yielding more interpretable models that highlight the biological drivers of a phenotype.

Methods

Filter methods score each feature independently using univariate statistics such as t-tests, ANOVA, mutual information, or correlation with the target variable. They are computationally efficient but ignore feature interactions. Wrapper methods evaluate feature subsets by training and assessing a model on each candidate subset; recursive feature elimination (RFE) with a support vector machine is a popular example. These methods capture interactions but are computationally expensive. Embedded methods perform selection during model training — L1 regularization (Lasso) drives irrelevant coefficients to zero, and tree-based models provide feature importance scores as a byproduct. Domain knowledge from pathway databases can further guide selection toward biologically meaningful variables.

Applications

Feature selection is essential in biomarker discovery, isolating a handful of diagnostic or prognostic markers from high-dimensional DNA microarrays and gene expression studies. In proteomics and mass spectrometry, it identifies discriminating spectral features for disease classification. The same principles apply to mass spectrometry metabolomics, where selection methods pinpoint metabolites that distinguish treatment groups, enabling the development of targeted clinical assays.