Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with wet-lab experiments like X-ray Crystrallograpy or NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. InteractionSites uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. InteractionSites2 (unpublished) uses new sequence derived features like predicted disorder in addition to the features already used in InteractionSites, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins.
How do we predict binding sites?
InteractionSites2 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, InteractionSites2 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence. Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
When the prediction algorithm is given a new sequence, it feeds the input to the three neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
What is predicted?
InteractionSites2 identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.).
What can you expect from binding sites prediction?
InteractionSites2 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. The neural network captures underlying data patterns to give the predictions. Three models of neural networks are employed to give the prediction on the basis of majority voting.
How do we predict?
The InteractionSites2 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa.
The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models.
The project is available at /mnt/project/isis in the Rostlab environment.
It is available in the following repository: /mnt/project/isis/isis2repo
Detailed ReadMe's are provided in the repo.