Natural Language Processing in Bioinformatics

Overview

Natural language processing (NLP) develops computer-aided methods for understanding, interpreting and generating human language. In bioinformatics, NLP handles the overwhelming volume of biomedical literature, with over a million new articles published annually. Beyond text mining, NLP techniques are now being applied directly to biological sequences by treating DNA, RNA, and proteins as languages with their own syntax and semantics, enabling biological language models that learn functional properties directly from unlabeled sequence data.

Methods

Literature mining uses named entity recognition (NER) to extract genes, diseases, drugs and their relationships from abstracts and full-text articles. Embedding models such as word2vec, GloVe and BioBERT convert words and phrases into dense vectors that capture semantic similarity. Large Language Models (LLM) such as GPT and BioMedLM generate human-like text and answer questions about biological knowledge. Biological language models such as DNABERT and SpliceBERT treat DNA sequences as a language and learn contextual representations of DNA fragments through masking and token prediction, analogous to BERT for natural language. Applications include document classification for systematic literature reviews, extracting protein-protein interaction relationships from text, and answering questions to support clinical diagnosis.

Practical protocol

A literature mining workflow begins with a collection of 10,000 PubMed abstracts retrieved via the Entrez API query with specific terms such as “breast cancer gene expression.” The corpus is preprocessed through tokenization, stopword removal and stemming. A pre-trained NER model, typically BioBERT fine-tuned on the BioCreative dataset, is applied to identify entities: genes, diseases and drugs. A dependency-based approach is used for relation extraction: sentences containing two entities of interest are analyzed for dependency trees, and paths between entities containing verbs such as “activated”, “inhibited” or “regulated” are classified as functional relationships using an SVM classifier trained on the BioCreative corpus. The process achieves a precision rate of 85% for known interactions. The extracted relationships flow into a knowledge database that is updated quarterly. For example, this method discovered 2,000 new gene-disease associations that were not captured in curated databases such as DisGeNet, 150 of which were later validated by independent studies. Using word embeddings from the surrounding context improved performance by 12% compared to standard NER. In a genome annotation application, DNABERT pre-trained on the human genome was fine-tuned for splice site prediction and achieved 96% accuracy in identifying exon-intron boundaries across the genome.

Applications

NLP accelerates systematic literature research by classifying molecular biology articles, extracts protein-drug relationships from the biomedical literature, and runs chatbots for interpreting biological data. Biological language models predict functional elements from the genome and annotate protein sequences without alignment.