Support Vector Machines (SVM) in Bioinformatics

Overview

Support vector machines (SVM) are supervised learning models that find the hyperplane that produces the maximum distance (margin) between classes in a high-dimensional space. Because they work only with the support vectors — the data points closest to the decision boundary — SVMs achieve robust generalization even when the number of features far exceeds the number of samples. Combined with kernel functions, SVMs capture nonlinear relationships without explicitly transforming the data, making them particularly effective for gene expression classification, protein-protein interaction prediction, and diagnosis based on methylation patterns.

Methods

Linear SVMs find a separating hyperplane in the original feature space, suitable when the classes are approximately linearly separable. The kernel trick projects data into a higher dimensional space using kernel functions such as RBF, polynomial or sigmoid, thus enabling nonlinear decision boundaries. Soft margin SVMs allow classification errors through a cost parameter C that trades off maximizing margin with training error. Class-weighted SVMs address class imbalances by penalizing minority class misclassifications more heavily. Parameter selection (C, gamma for RBF) is typically done by grid search with cross-validation.

Practical protocol

An SVM workflow for methylation-based cancer classification begins with DNA methylation data of 450,000 CpG sites from 200 tumor and 100 normal samples. The data is normalized using quantile normalization and low variance features are filtered, leaving 50,000 informative probes. A PCA is applied to reduce the dimensionality to 100 components that explain 70% of the variance. The data is divided into 70% training and 30% testing, stratified by disease status. An SVM with RBF kernel is trained; the hyperparameters C (range 0.1-1000) and gamma (range 0.0001-0.1) are optimized by grid search with 5-fold cross-validation, selecting the pair that maximizes the AUC. The optimized model achieves an AUC of 0.94 on the test set. The model coefficients are examined to identify the most influential CpG regions that map to gene promoters and enhancers. A permutation test is used to evaluate the statistical significance of classification performance. In a blood biomarker discovery application, SVMs classified samples from Alzheimer’s patients versus healthy controls using peripheral blood methylation data and achieved 88% accuracy with 25 biomarker CpG regions. These regions were located in genes involved in synaptic plasticity and neuroinflammation, providing both a diagnostic tool and mechanistic insights. The model was validated in an independent cohort of 400 patients, maintaining 85% accuracy and demonstrating robustness across all disease stages.

Applications

SVMs classify tumors by type from gene expression data, predict protein-protein interactions from sequence features, distinguish disease stages in microscopy data, and classify the pharmacological activity of compounds in drug discovery. Their ability to handle data with many features and few samples makes them the preferred choice across bioinformatics.