Snap2

From Rost Lab Open
Jump to: navigation, search


See the tool's GitHub repository, for the most updated information.

Contents

How we predict functional effects?

Functional effects of mutations are predicted with SNAP2. SNAP2 is a trained classifier that is based on a machine learning device called "neural network". It distinguishes between effect and neutral variants/non-synonymous SNPs by taking a variety of sequence and variant features into account. The most important input signal for the prediction is the evolutionary information taken from an automatically generated multiple sequence alignment. Also structural features such as predicted secondary structure and solvent accessibility are considered. If available also annotation (i.e. known functional residues, pattern, regions) of the sequence or close homologs are pulled in. In a cross-validation over 100,000 experimentally annotated variants, SNAP2 reached a sustained two-state accuracy (effect/neutral) of 82% (at an AUC of 0.9). In our hands this constitutes an important and significant improvement over other methods.
(thumbnail)
Fig 1: SNAP prediction score correlates with severity of effect. Severe effects are shown in red and can be found most abundantly on the right side (high scores, Score>50). Intermediate effects are shown in blue and are most frequent in the middle region (intermediate scores, -40<score<50). Neutral substitutions are shown in green and most frequent on the left (low scores, score<-40


References:

Hecht M, Bromberg Y & Rost B. Better prediction of functional effects for sequence variant. BMC Genomics. 2015; 16(Suppl 8):S1 PubMed Full PDF

Bromberg Y & Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Research, 2007, Vol. 35, No. 11 3823-3835 PubMed Full PDF

Hecht M, Bromberg Y, & Rost B. (2013). News from the protein mutability landscape. Journal of molecular biology, 425(21), 3937-3948. PubMed Full PDF


What is predicted?

SNAP2 predicts the impact (effect) of single amino acid substitutions on protein function. For a given substitution e.g. Arginine (R) at position 152 is substituted by Asparagine (N) -typically abbreviated to R152N- we predict a score (ranges from -100 strong neutral prediction to +100 strong effect prediction) that reflects the likelihood of this specific mutation to alter the native protein function. Moreover, our analysis suggests that the prediction score is to some extent correlated to the severity of effect as shown in Fig.1.

Prediction visualization

We predict (each substitution independently) and show every possible substitution at each position of a protein in a heatmap representation. Dark red indicates a high score (score>50, strong signal for effect), white indicates weak signals (-50<score<50), and green a low score (score<-50, strong signal for neutral/no effect. Please note that the new webserver uses blue instead of green). Black marks the corresponding wildtype residues. Fig. 2 shows an example of such a heatmap.

(thumbnail)
Fig 2: SNAP prediction scores for the human beta-2 adrenergic receptor (ADRB2_HUMAN) shown as a heatmap. Shown is the predicted effect of each mutation from the wildtype amino acid (x-axis) to any other amino acid (y-axis). Red indicates a strong signal for effect, white a weak signal (inconclusive prediction), and green a strong signal for neutral (the new version of the webserver uses blue instead of green to mark neutral variants).

Prediction algorithm

SNAP2 is a neural network based classifier. The feed-forward multilayer perceptron consists of 848 nodes in the input layer, 25 nodes in the hidden layer and two nodes (one for each neutral/effect) in the output layer. In each training step, all samples are presented to the network and the connection weights are adjusted through a backpropagation algorithm. The final method consists of ten different models (created during 10-fold cross-validation using different subsets for training and optimization). Each model outputs one score for each output class (neutral/effect). These scores of 10 models are averaged in a jury decision. The final score is calculated as the difference of the average score for effect and the average score for neutral.


Implementation

All neccessary information (e.g. secondary structure, solvent accessibility, disorder, alignments of related sequences etc.) is produced by the PredictProtein pipeline. Feature calculation algorithms are written in Perl and transform the predictions into normalized numeric input values. The neural network implementation is provided through the Fast Artificial Neural Network Library.

Data Sets

SNAP2 that was trained on ~100.000 variants from OMIM, PMD and a set of pseudo-neutral variants based on the enzyme comission numbers. Details can be found in the SNAP publications.

Preditcion reliability score

(thumbnail)
Fig 3: Stronger predictions are more accurate. For each level of reliability (markers in the plot) cumulative accuracy (solid lines) and coverage (dashed lines) are shown seperatly for Effect (red) and Neutral (green) variant predictions above the corresponding RI.

The reliability score (Reliability Index: RI) is calculated from the final prediction score. Fig 3. shows that stronger predictions are more reliable. This measure is meant to simplify assessing prediction strength and to immediatly convey the reliabilty of a prediction. The score ranges from 0 (very low reliabilty) to 9 (very high reliabilty)

Accuracy of functional effect prediction

We evaluated the performance of SNAP2 in a 10-fold cross-validation and on two independent datasets. The overall two-state accuracy was shown to be above 82% in the cross-validation with an area under the ROC curve of 0.9. We estimated an standard error of 0.013 in a bootstrapping scenario of 1000 sets each consisting of 50k randomly drawn (without replacement) samples.


Additional sources

A web server is available here.

Contact

For questions and/or comments please contact hecht@rostlab.org

Personal tools