Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with lab experiments like X-Ray Crystrallograpy and NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. ISIS uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. ISIS2.0 uses new sequence derived features like Meta Disorder in addition to the features already used in ISIS, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins from ISIS.
How do we predict binding sites?
ISIS2.0 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, ISIS2.0 can be used to predict binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learnt from protein sequence. Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use the cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
When the prediction algorithm is given a new (novel) input, it feeds the input to the three chosen neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
What is predicted?
ISIS identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein routine like FASTA, PSI-BLAST, META DISORDER, coils.
What can you expect from binding sites prediction?
ISIS 2.0 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score(from -100 to 100) for the prediction. Due to a wide variety of protein-protein interactions, ISIS which captures the basic principle to identify these sites, performs very well on some proteins whereas not so well on the others. This can be explained with the similarity of the protein sequence with the samples used in training. For more similar proteins it will perform better than the less similar proteins.
How do we predict?
The ISIS2.0 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein routine are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
The test set is input to the three best models and a threshold for classification for best models is decided based on the test set classification score. Threshold is defined as a baseline score above which a residue is predicted as a binding site and below which as a non-binding residue. The more the difference between the threshold and the score predicted as binding, higher is the reliability of the prediction and vice-versa.
The classifier consists of the three best models and gives binary prediction and reliability score (from -100 to 100). Binary prediction is calculated as the mode of three models. Score is given by averaging the scores of three models. The availability of the methods of PredictProtein routine to calculate the structural and functional features for the input protein sequences is a prerequsite for ISIS2.0 to work.
The project is available at /mnt/project/isis directory on roslab server. There are two repositories:
The first repository is for training the neural network for different combinations of parameters and features. The second repository is the predictor repo which uses permutation-wise best models to predict interaction sites for the proteins. Detailed ReadMes are provided in the repos.