Difference between revisions of "InteractionSites2"
(→Reference) |
|||
(19 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
==Abstract== |
==Abstract== |
||
− | Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with lab experiments like X- |
+ | Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with wet-lab experiments like X-ray Crystrallograpy or NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. |
− | Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. |
+ | Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. InteractionSites uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. InteractionSites2 (unpublished) uses new sequence derived features like predicted disorder in addition to the features already used in InteractionSites, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins. |
− | |||
==How do we predict binding sites?== |
==How do we predict binding sites?== |
||
− | + | InteractionSites2 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, InteractionSites2 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence. |
|
Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets. |
Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets. |
||
Line 14: | Line 13: | ||
==What is predicted?== |
==What is predicted?== |
||
− | + | InteractionSites2 identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.). |
|
==What can you expect from binding sites prediction?== |
==What can you expect from binding sites prediction?== |
||
− | + | InteractionSites2 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. The neural network captures underlying data patterns to give the predictions. Three models of neural networks are employed to give the prediction on the basis of majority voting. |
|
==How do we predict?== |
==How do we predict?== |
||
− | The |
+ | The InteractionSites2 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation. |
− | The test set is input to the three best models and |
+ | The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa. |
− | The classifier consists of the three best models and gives binary prediction and reliability score (from -100 to 100). |
+ | The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models. |
− | |||
==Availability== |
==Availability== |
||
− | The project is available at /mnt/project/isis |
+ | The project is available at /mnt/project/isis in the Rostlab environment. |
+ | |||
− | There are two repositories: |
||
− | + | It is available in the following repository: /mnt/project/isis/isis2repo |
|
− | * /mnt/project/isis/isis-predictor-repo |
||
+ | Detailed ReadMe's are provided in the repo. |
||
− | The first repository is for training the neural network for different combinations of parameters and features. The second repository is the predictor repo which uses permutation-wise best models to predict interaction sites for the proteins. Detailed ReadMes are provided in the repos. |
||
==Reference== |
==Reference== |
||
− | + | InteractionSites1: [http://bioinformatics.oxfordjournals.org/content/23/2/e13.abstract Y. Ofran and B. Rost (2006): InteractionSites: interaction sites identified from sequence. ''Bioinformatics, 23(2), e13-e16''] |
|
− | Yanay Ofran; Burkhard Rost. |
||
+ | InteractionSites2: unpublished |
||
− | Bioinformatics 2006, Vol 23 |
Latest revision as of 15:32, 3 February 2016
Contents
Abstract
Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with wet-lab experiments like X-ray Crystrallograpy or NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. InteractionSites uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. InteractionSites2 (unpublished) uses new sequence derived features like predicted disorder in addition to the features already used in InteractionSites, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins.
How do we predict binding sites?
InteractionSites2 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, InteractionSites2 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence. Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
Overview
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
When the prediction algorithm is given a new sequence, it feeds the input to the three neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
What is predicted?
InteractionSites2 identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.).
What can you expect from binding sites prediction?
InteractionSites2 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. The neural network captures underlying data patterns to give the predictions. Three models of neural networks are employed to give the prediction on the basis of majority voting.
How do we predict?
The InteractionSites2 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa.
The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models.
Availability
The project is available at /mnt/project/isis in the Rostlab environment.
It is available in the following repository: /mnt/project/isis/isis2repo
Detailed ReadMe's are provided in the repo.
Reference
InteractionSites1: Y. Ofran and B. Rost (2006): InteractionSites: interaction sites identified from sequence. Bioinformatics, 23(2), e13-e16
InteractionSites2: unpublished