Proteins are the machinery of life, they perform different functions necessary for a life form to live. The intricate details of the protein structure are important for their function. It takes large effort to determine protein structure with lab experiments like X-Ray Crystrallograpy and NMR spectroscopy. Due to these limitations, currently the structure of only a very small fraction of proteins is known. Therefore it is highly desirable to develop methods which can predict some relevant structural and functional information with protein sequence alone. ISIS uses machine learning to predict which residues are likely to be part of binding sites of that protein with other proteins. ISIS2.0 uses new sequence derived features like predicted disorder in addition to the features already used in ISIS, more proteins with known three-dimensional(3D) structure for training, better infrastructure and faster neural network implementation (FANN) to improve the predictions obtained for unknown proteins.
How do we predict binding sites?
ISIS2.0 predicts residues likely to be part of binding sites with aid of predicted structural features and evolutionary information. Since for 97% of the proteins three dimensional or high resolution structure is not available, ISIS2.0 can be used to predict protein binding sites for them. Despite the diversity of protein-protein interactions, the underlying concepts defining these interactions can be learned from protein sequence. Binding site residues are important to understand the protein-protein interaction mechanism and using the information present in protein sequence to derive this knowledge can be very important as binding site residues are often drug targets.
In our approach, we use a neural network to predict binding sites. The neural networks are trained with many combinations of parameters and features. We then use cross-validation results to gauge the performance of each of the neural networks. The performance is calculated as the average precision value for 50% recall. Based on this performance, three neural networks are selected for prediction.
When the prediction algorithm is given a new sequence, it feeds the input to the three neural networks. The mode of the decisions made by the neural networks is the result of the predictor.
What is predicted?
ISIS identifies the residues which contribute over-proportionately high to the overall interface binding energy. It does not use three dimensional structure of the proteins to predict Interaction sites residues. Instead, it relies on structural and functional sequence derived features from methods of PredictProtein (Meta Disorder, PROF etc.).
What can you expect from binding sites prediction?
ISIS 2.0 is used to predict which residues are likely to bind with other proteins. The predictor gives binary output for every amino acid in a test protein (1 – binding , 0 – non-binding) and a reliability score (from -100 to 100) for the prediction. Due to a wide variety of protein-protein interactions, EDIT: isis is trained only on non-similar proteins, so it should not rely on sequence similarity. Reference picture and reliability index EDIT. ISIS which captures the basic principle to identify these sites, performs very well on some proteins whereas not so well on the others. This can be explained with the similarity of the protein sequence with the samples used in training. For more similar proteins it will perform better than the less similar proteins.
How do we predict?
The ISIS2.0 neural network is trained on proteins for which binding site residues are known. Different sequence based structural and functional prediction methods of PredictProtein are run on these protein sequences, and these are used as input features to the neural network. The data is partitioned into training, cross-training, test and holdout set. The neural network is trained for different hyperparameters for three permutations. Then precision recall is calculated for the cross-training set and average precision till 50% recall is used to select three best models, one for each permutation.
The test set is input to the three best models and evaluated. Based on this evaluation, thresholds are chosen for the binary interface prediction. This means all residues with a score above a threshold are predicted as binding site and below as non-binding residues. The higher the difference between a threshold and a score, the higher the reliability of the prediction and vice-versa.
The final classifier consists of the three best models and gives a single binary prediction and reliability score (from -100 to 100) for each input residue. The binary prediction is calculated as the mode of three models. The score is given by averaging the scores of three models.
The project is available at /mnt/project/isis directory on rostlab server. There are two repositories:
The first repository is for training the neural network for different combinations of parameters and features. The second repository is the predictor repo which uses permutation-wise best models to predict interaction sites for the proteins. Detailed ReadMes are provided in the repos.
ISIS: interaction sites identified from sequence Yanay Ofran; Burkhard Rost.
Bioinformatics 2006, Vol 23