Using genetic algorithms to select most predictive protein features.

TitleUsing genetic algorithms to select most predictive protein features.
Publication TypeJournal Article
Year of Publication2009
AuthorsKernytsky, A, Rost, B
JournalProteins
Volume75
Issue1
Pagination75-88
Date Published2009 Apr
ISSN1097-0134
KeywordsAlgorithms, Computational Biology, Computer Simulation, Databases, Protein, Models, Molecular, Neural Networks (Computer), Protein Conformation, Proteins, Serine Endopeptidases, Structure-Activity Relationship
Abstract

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine-learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine-learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence-derived features and to correlate those features with the desired output. We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the coexistence of residue-based features. It is this combination of individual features, that is the step from the fractions of serine and buried (input space 20 + 2) to the fraction of buried serine (input space 20 * 2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm (GA) paired with neural networks and support vector machines. We find that the GA is critical for selecting combinations of features that are neither too general resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity.

DOI10.1002/prot.22211
Alternate JournalProteins
PubMed ID18798568
Grant ListR01 LM007329-06 / LM / NLM NIH HHS / United States
R01-GM079767 / GM / NIGMS NIH HHS / United States
R01-LM07329-01 / LM / NLM NIH HHS / United States
U54-GM074958-01 / GM / NIGMS NIH HHS / United States