Difference between revisions of "InteractionSites2"

From Rost Lab Open
(Reference)
 
(30 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=====Abstract=====
+
==Abstract==
Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with lab experiments like X-Ray Crystrallograpy and NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known.
+
Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with wet-lab experiments like X-ray Crystrallograpy or NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known.
Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. ISIS uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. ISIS2.0 uses new sequence derived features like Meta Disorder in addition to the features already used in ISIS, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins from ISIS.
+
Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. InteractionSites uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. InteractionSites2 (unpublished) uses new sequence derived features like predicted disorder in addition to the features already used in InteractionSites, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins.
   
  +
==How do we predict binding sites?==
 
  +
InteractionSites2 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, InteractionSites2 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence.
=====How do we predict binding sites?=====
 
ISIS2.0 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, ISIS2.0 can be used to predict binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learnt from protein sequence.
 
 
Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
 
Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
   
=====Overview=====
+
==Overview==
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use the cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
+
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
  +
  +
When the prediction algorithm is given a new sequence, it feeds the input to the three neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
  +
  +
==What is predicted?==
  +
InteractionSites2 identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.).
   
  +
==What can you expect from binding sites prediction?==
When the prediction algorithm is given a new (novel) input, it feeds the input to the three chosen neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
 
  +
InteractionSites2 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. The neural network captures underlying data patterns to give the predictions. Three models of neural networks are employed to give the prediction on the basis of majority voting.
   
  +
==How do we predict?==
=====Detailed description======
 
  +
The InteractionSites2 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
   
  +
The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa.
======What is predicted?======
 
ISIS identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein routine like FASTA, PSI-BLAST, META DISORDER, coils.
 
   
  +
The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models.
 
=====What can you expect from binding sites prediction?======
 
   
  +
==Availability==
ISIS 2.0 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score(from -100 to 100) for the prediction. Due to a wide variety of protein-protein interactions, ISIS which captures the basic principle to identify these sites, performs very well on some proteins whereas not so well on the others. This can be explained with the similarity of the protein sequence with the samples used in training. For more similar proteins it will perform better than the less similar proteins.
 
  +
The project is available at /mnt/project/isis in the Rostlab environment.
   
  +
It is available in the following repository: /mnt/project/isis/isis2repo
=====How do we predict?=====
 
The ISIS2.0 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein routine are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
 
   
  +
Detailed ReadMe's are provided in the repo.
The test set is input to the three best models and a threshold for classification for best models is decided based on the test set classification score. Threshold is defined as a baseline score above which a residue is predicted as a binding site and below which as a non-binding residue. The more the difference between the threshold and the score predicted as binding, higher is the reliability of the prediction and vice-versa.
 
   
  +
==Reference==
The classifier consists of the three best models and gives binary prediction and reliability score (from -100 to 100). Binary prediction is calculated as the mode of three models. Score is given by averaging the scores of three models. The availability of the methods of PredictProtein routine to calculate the structural and functional features for the input protein sequences is a prerequsite for ISIS2.0 to work.
 
  +
InteractionSites1: [http://bioinformatics.oxfordjournals.org/content/23/2/e13.abstract Y. Ofran and B. Rost (2006): InteractionSites: interaction sites identified from sequence. ''Bioinformatics, 23(2), e13-e16'']
 
   
  +
InteractionSites2: unpublished
=====Technical infromation======
 
The project is available at /mnt/project/isis directory on roslab server.
 

Latest revision as of 15:32, 3 February 2016

Abstract

Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with wet-lab experiments like X-ray Crystrallograpy or NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. InteractionSites uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. InteractionSites2 (unpublished) uses new sequence derived features like predicted disorder in addition to the features already used in InteractionSites, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins.

How do we predict binding sites?

InteractionSites2 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, InteractionSites2 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence. Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.

Overview

In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.

When the prediction algorithm is given a new sequence, it feeds the input to the three neural networks. The mode of the decisions made by the neural networks is the result of the predictor.

What is predicted?

InteractionSites2 identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.).

What can you expect from binding sites prediction?

InteractionSites2 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. The neural network captures underlying data patterns to give the predictions. Three models of neural networks are employed to give the prediction on the basis of majority voting.

How do we predict?

The InteractionSites2 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.

The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa.

The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models.

Availability

The project is available at /mnt/project/isis in the Rostlab environment.

It is available in the following repository: /mnt/project/isis/isis2repo

Detailed ReadMe's are provided in the repo.

Reference

InteractionSites1: Y. Ofran and B. Rost (2006): InteractionSites: interaction sites identified from sequence. Bioinformatics, 23(2), e13-e16

InteractionSites2: unpublished