bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Protein-protein interaction hot spots carved into sequences
Author:Yanay Ofran & Burkhard Rost
Quote: PLoS Computationl Biology, 2007, 3:e119

 

Protein-protein interaction hot spots carved into sequences

Yanay Ofran 1,2,4 & amp; Burkhard Rost ?

1 Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
4 Dept. of Medical Informatics, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA
* Corresponding author: ofran@rostlab.org URL http://www.rostlab.org/  Tel: +1-212-851-4669, fax: +1-212-305-7932

 

Table of contents


Abstract

Protein-protein interactions, a key to almost any biological process, are mediated by molecular mechanisms that are not entirely clear. The study of these mechanisms often focuses on all residues at protein-protein interfaces. However, only a small subset of all interface residues are actually essential for recognition or binding. Commonly referred to as "hot spots", these essential residues are defined as residues that impede protein-protein interactions if mutated. While no in silico tool identifies hot spots in unbound chains, numerous prediction methods were designed to identify all the residues in a protein that are likely to be a part of protein-protein interfaces. These methods typically identify successfully only a small fraction of all interface residues. Here, we analyzed the hypothesis that the two subsets correspond, i.e. that in silico methods may predict few residues because they preferentially predict hot spots. We demonstrate that this is indeed the case and that we can therefore predict directly from the sequence of a single protein which residues are interaction hot spots (without knowledge of the interaction partner). Our results suggested that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hot spots from sequence enables the annotation and analysis of protein-protein interaction hot spots in entire organisms and thus may benefit function prediction and drug development.



Synopsis (non-technical summary)

Interactions between proteins underlie all biological processes. Hence, to fully understand or to control biological processes we need to unravel the principles of protein interactions. The quest for these principles has focused predominantly on the entire interfaces between two interacting proteins. However, it has been shown that only few of the interface residues are essential for the recognition and binding to other proteins. The identification of these residues, commonly referred to as binding "hot spots", is a first step toward understanding the function of proteins and studying their interactions. Experimentally, hot spots could be identified by mutating single residues - an expensive and laborious procedure which is not applicable on a large scale. Here, we show that it is possible to identify protein interaction hot spots computationally on a large scale based on the amino acid sequence of a single protein, without requiring the knowledge of its interaction partner. Our results suggest that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hot spots from sequence enables the annotation and analysis of protein-protein interaction hot spots in entire organism and thus may benefit function prediction and drug development.


Introduction

Interactions of proteins are at the heart of almost every biological process. Thus, the understanding of biological mechanisms requires the knowledge of protein-protein interactions and the molecular principles that underlie them. Large-scale studies unravel networks of protein-protein interactions in cells and identify interacting pairs of proteins [1, 2, 3, 4, 5] . However, to fully understand these interactions, and to manipulate them, we need to identify the residues that account for binding of the proteins and stabilizing the complexes. It has been postulated that only very few of the residues in protein-protein interfaces are absolutely essential for the interaction (in a typical 1200-2000 2 interface, less than 5% of interface residues contribute more than 2kcal/mol to binding. In small interfaces this can mean as few as one amino acid on each protein) [6] . These residues may be instrumental in understanding the interaction and could be desired drug targets [7] .

The ability to predict hot spots on a large scale may assist in identifying, analyzing and comparing binding sites for drugs. Given a detailed three-dimensional (3D) structure of a complex, the residues crucial for binding are often identifiable. The Hendrickson lab, for instance, identified the most essential binding residues from their 3D structure of HIV glycoprotein (gp120) and CD4 receptor [8] . Unfortunately, 3D structures are available for less than 1% of all known pairs of interacting proteins. In absence of 3D structures, the most conclusive way to probe the importance of particular residues for interaction is to experimentally mutate them, typically to alanine, and measure the effect of this substitution on the interaction [9, 10] . Many experiments have demonstrated that most interface residues could be mutated without affecting the affinity of the protein to its partners [11, 12] . Those few residues that, upon mutation, change the affinity are often assumed to be the most essential for the interaction and are deemed "hot spots" [6] . The limited overlap between interface residues and hot spots is demonstrated in Fig. 1 which depicts the complex of the human growth hormone and its receptor [13] . In the bound state ( Fig. 1 A) large patch on the surface of the receptor is buried in the interface. There are 31 residues on the receptor that are in physical contact with hormone ( Fig. 1 B). However, mutations experiments indicate that only six of these residues are energetically crucial for the interaction ( Fig. 1 B).

Fig. 1
fig1.png

  

Fig. 1 Protein-protein interfaces, hot spots and predictions. Residues that are part of protein-protein interface often constitute a large fraction of the protein. Hot spots residues, namely residues that upon mutation hamper the interaction, are only a small fraction of these interface residues. Interestingly, methods designed to predict interface residues usually capture only a small fraction of them. A) Human growth hormone (yellow) bound to the extracellular portion of its homodimeric receptor. B) The chains of the receptor (grey) are 201 residues long. The protein-protein interface covers 31 of these residues (blue and red) on each of the chains. Mutating one of the six residues colored in red abrogates or severely hampers the interaction. C) A prediction method (ISIS, see text) that was designed to identify all interface residues managed to capture only five of the interface residues (colored green).

 



The ways to identify hot spots have been subject to theoretical debates. It has been pointed out that given the structural and physicochemical complexity of proteins, the physicochemical features of a protein are not a simple sum of the features of its individual residues [14] . Therefore, single mutations may not always convey accurate assessments of the contribution of a residue to the interaction [15, 16] . The theoretical validity of this argument notwithstanding, alanine scans have become the most widely used tool for identifying binding sites. While single mutations may not be tantamount to isolating the contribution of a single residue to the interaction, they are still considered a good approximation. Here, we adopt the following operational definition: If a mutation of a residue in protein-protein interface changes the binding energy of the protein to its binding partner substantially (ΔΔG >2.5kcal/mol), then this residue is a hot spot residue.

To the best of our knowledge, there is currently no method that was designed to identify hot spots from sequence. However, many methods attempt to use sequence or structure to identify which residues are located in the interface between proteins [17-32]. Many of the methods that identify residues in protein-protein interfaces reach impressive levels of positive accuracy (residues correctly predicted to be in protein-protein interfaces as fraction of all residues predicted to be in protein-protein interfaces; often also referred to as selectivity, or precision eqn. 1). However, their coverage (residues correctly predicted in interfaces as percentage of observed interface residues; often also referred to as sensitivity, or recall eqn. 2) remains fairly low. In other words, although these methods attempt to identify all interface residues (all the residues that are colored blue or red in Fig. 1 B) they capture only a small fraction of them (For example, only the green residues in Fig. 1 C). We hypothesized that the reason for the low coverage of many prediction methods might be that the residues that are missed are more similar to the general population of surface residues than to the essential residues, i.e. they are inconsequential for the interaction. Therefore, a machine-learning algorithm trained on all protein-protein interface residues may learn to disregard the non-hot spot residues as noise, and identify only hot spots residues as the signal to be learned.

To test this hypothesis, we applied ISIS, a prediction method developed for the prediction of all interface residues [17] , to the task of predicting only hot spots. ISIS was never trained on hot spots (Methods). Instead, we trained on ALL interface residues found in PDB complexes, i.e. all interface residues were labeled positive and all other residues were labeled negative. The features on which ISIS was trained included the sequence environment of each residue (four residues on each side); the evolutionary profile of all nine residues in that window; the predicted solvent accessibility of the residue and the solvent accessibility of its immediate sequence environment (one residue on each side); the predicted secondary structure state of the residue and its immediate sequence environment (one residue on each side); and a conservation score for each residue. Like several other methods mentioned above, ISIS predicts residues in protein-protein interfaces very accurately (~90% accuracy). However, at this high level of accuracy ISIS identifies fewer than 5% of the residues that were experimentally mapped to the interface.

The novelty here is that we applied a generic interface-prediction method to the specific task of identifying only the residues that are crucial for stabilizing the interactions, i.e. the hot spots. The results demonstrated a surprising overlap between two principally unrelated data sets, namely on the one hand the subset of residues that was identified by experimental alanine mutations as hot spots, and on the other hand the subset of residues predicted by ISIS to be protein-protein interfaces residues. We obtained a large data set of hot spots that were determined experimentally through alanine scans (Methods) and assessed the performance of ISIS on these hot spots. The results confirmed our hypothesis that the residues predicted by the machine learning method are, in fact, the hot spots. Analysis of the results indicated that accurate predictions of hot spots required the combination of sequence features, evolutionary information, and predicted structural features; all this information was generated from the amino acid sequence suggesting that the commonalities of hot spots have been imprinted clearly onto amino acid sequences in the course of evolution.

 


Results

Using 296 point mutations from 30 proteins we compared the residues predicted by ISIS to the ones experimentally identified to be hot spots (Methods). We first analyze the results for two representative examples. Then, we assess the performance in predicting hot spots based on the analysis of the entire dataset of 296 mutations. Note, that although the 3D structures for most of these proteins were experimentally known, ISIS predicted interface residues from sequence alone. At no stage of the predictions did we use the experimentally determined structure. The only way in which we used 3D information was to visualize our results, as we mapped the predictions to the experimentally determines structure ( Fig. 2 ).

 

HIV gp120/CD4 receptor complex. One of the most comprehensive alanine scans of all the complexes with known 3D structures is that between the CD4 receptor and the HIV glycoprotein gp120. This interaction involves backbone interactions, mainly on the gp120 side. However, we focused our analysis on the human CD4 receptor. Ashkenazi et al. [18] sequentially mutated many residues in the V1 domain of the CD4 receptor and studied the effect of each substitution on the binding affinity between CD4 and the HIV gp120 protein. Using a set of specific antibodies they also assessed which mutation had no effect on the structure. They identified 25 positions within a stretch of 94 residues on CD4 that upon substitution changed the affinity of CD4 substantially, without strongly altering the conformation of the protein. Within the same 94-residue segment ( Fig. 2 A), we predicted 30 residues as interface residues, 19 of these were found experimentally to have a strong effect in on binding. Of the six residues that ISIS has missed four were next to predicted interface residues. Five of the predictions that were not confirmed experimentally were residues that were not mutated in the study. Our method uses predicted structural features (solvent accessibility and secondary structure). Hence, its performance depends to some extent on the accuracy of these predictions. If we have a 3D structure of the unbound chain, we can improve accuracy and coverage by using the experimental rather than the predicted features. For example, when we used the unbound structure of CD4 as input for ISIS we found a few additional residues that were not identified from sequence alone. The two residues that scored highest (i.e. about which we were most confident that they participate in binding) were Arg59 and Phe43. The high-resolution structure of the complex between gp120 and CD4 complex [8] revealed two residues as the most important contacts between these two proteins: Arg59 and Phe43.

Fig. 2
fig2.png

  

Fig. 2 Accuracy of predictions of hot spots. The ability to identify the residues that account for most of the energy of binding is assessed both on particular proteins and on a large dataset of alanine scans. A. Alanine scans and predictions of essential interface residues in the V1 domain of CD4. The red rectangles (above sequence) mark positions that were shown to have significant effect on the affinity of the binding between CD4 to gp120 upon substitution to alanine [18] ; the same residues are colored in red on the lower left surface representation of CD4 (PDB ID 1wiq_A). The green rectangles (below sequence) mark positions predicted to participate in a protein-protein interaction; these residues are also colored in violet on the lower right. Note that five of the residues predicted in interfaces were not mutated in the alanine scan. Thus, we cannot evaluate their correctness and left them out of this analysis. B. Hot spots experimentally observed and predicted for the shaker voltage-gated potassium channel. All predictions and experimental substitutions [19] for this stretch are reported in this figure. C. Accuracy vs. coverage in predicting hot spots. The performance of ISIS (circles) and random assignment (triangles) using 296 alanine scans as gold standard. The data was compiled for a set of proteins that was not used for developing the method. The stronger the confidence in our prediction, the higher the accuracy and the lower the coverage, i.e. when we select the strongest predictions (moving upward in the figure), most of these are correct. Around accuracy of 0.61 (right hand side of the plot), ISIS correctly predicted most of the interacting residues in our test set.

 



 

Voltage-gated potassium channel. Because of a variety of reasons, membrane proteins are a particularly popular target for alanine scans. One such alanine scans is available for the shaker voltage-gated K+ channel [19] . Within a region of 29 consecutive residues that have been scanned, eight have a significant effect on the affinity of the channel to its inhibitors Agitoxin2 and charybdotoxin. We used this region as input to our method ignoring any available structural information, and predicted 13 residues ( Fig. 2 B). Seven of the eight residues that were found experimentally were predicted by ISIS; the only residue that was missed is buried in the structure and hence is likely to affect the interaction indirectly through a conformational change. Of the six residues in our prediction that did not coincide with the residues implicated as important by the alanine scanning, five coincided with positions that were found to have significant although less dramatic effects on binding [19] .

 

Performance over entire dataset. Within our set of alanine scans, almost all binding residues predicted by ISIS were found experimentally to have significant effect on binding ( Fig. 2 C). Furthermore, over 90% of the negative predictions (predicted not to be involved in protein-protein interactions) were confirmed experimentally to have no effect on the energy of binding. These results were particularly surprising in light of the fact that ISIS never explicitly evaluated any energetic parameters. Using different confidence thresholds, i.e. picking a different point on the curve in Fig. 2 C, it is possible to increase accuracy (true positives/all positives) at the expense of coverage (true positives/predicted positives). Note that the results for the two examples ( Fig. 2 A and Fig. 2B) discussed in detail are similar to the performance of ISIS on the entire data set of 296 mutations.

 


Discussion

Hot spots are easy to identify but hard to define.  We used ISIS to represent methods that predict interface residues at high accuracy and low coverage. The results suggested that the system of neural networks that underlies ISIS learned to identify the hot spots, despite the fact that they were only a small subset of the samples that were labeled as interaction residues. The system effectively disregarded most of the residues observed in interface, i.e. the pupil (neural network) clearly ignored the teacher (labeled data). We found that the residues ignored were mostly non-hot spot residues. These results indicated that the biophysical common denominators of hot spots are so pronounced that the neural networks could identify them without specific labeling in the training phase.

What are these features that are common to hot spots? Unfortunately, we cannot simply list a few rules or features to describe these commonalities. The neural networks identified a set of complex non-linear correlations between the input features we used and hot spots residues. It is impossible to translate the subtle and complex dependencies that were identified by the neural networks into simple explanations, or a set of rules, in English. However, it is possible to infer which features are more or less relevant. To this end, we trained several systems using different combinations of input features. Neural networks that were trained only on the sequence environment of interface residues performed only slightly (although significantly) better than random (data not shown). Adding evolutionary information significantly improved performance on both interface residues and on hot spots. This result was somewhat surprising given that the conservation of predicted hot spots was only marginally different from that of all other residues ( Fig. 3 ). Conversely, predicted non-hot spot residues were only marginally less conserved than the background. In other words, although the overall difference in conservation was marginal, the addition of this information to the neural networks input substantially improved performance. Apparently, the neural networks have learned to distinguish between conservation that is indicative of hot spots, and conservation that is not. Strikingly, they did so without being trained on hot spots. This underscores why linear combinations of input features did not suffice and why the extraction of singly important commonalities would at best be misleading.

Fig. 3
fig3.png

  

Fig. 3 The common features of hot spots are hard to identify without machine learning Physicochemical, structural and evolutionary features differentiate hot spots from other residues. However, while each of these features is crucial for the success of the prediction, a simple, linear combination of them will not suffice.The distributions of residue conservation (X-axis, HSSP [36] conservation score with an average conservation of 1) are compared between the entire sequences of the proteins in the dataset (cyan), hot spots (yellow), and residues with no effect (measured by alanine scans) on protein-protein binding (magenta). The Y axis gives the fraction of residues with a given level of conservation. The differences are marginal but the overall effect of conservation on the prediction is substantial.

 



The analysis of the contribution of each feature suggested that successful predictions of hot spots required the combination of all features. However, even when some of these features were not available, ISIS still could provide accurate predictions: e.g. 15 <<<10 homologues in todays databases. For these proteins the success in predicting hot spots was lower, but still significantly higher than random (at 70% positive accuracy, >10% of the experimentally determined hot spots were identified; compared to about 70/20 for all proteins, Fig. 2 ).

 

Successful hot spot predictions require specific in silico tools. We did not benchmark the ability of prediction methods other than ISIS to predict hot spots. The main reason was that no existing method (including ISIS) was designed to predict hot spots. The ability of ISIS to identify hot spots is an unintended consequence of the power of neural networks. Therefore, when comparing ISIS to other methods one should remember that this comparison does not benchmark these methods in the task for which they were originally developed. Still the question remains whether or not any method designed to predict interface residues could predict hot spots at levels of accuracy as high as the ones we reported for ISIS? To address this question we applied a few representative interface prediction methods to the task of predicting hot spots. In particular, we chose methods that rely on different input feature. Analysis of the results indicated that methods that did not rely on a combination of physicochemical features, evolutionary conservation and structural features failed to identify hot spots.

 

What does it take to predict hot spots? We applied several prediction methods that were designed to identify interface residues to the task of predicting hot spots. To eschew obfuscation: our aim was not to benchmark methods not designed to identify hot spots. Instead, we applied these methods to narrow down the features needed to successfully predict hot spots.

The Evolutionary Trace (ET) method [20] correlates evolutionary importance of residues with their importance for function. We used ET to represent the approach that relies predominantly on evolutionary conservation. Gallet et al. [21] have attempted to predict interaction sites from simple biophysical features; the method computes the hydrophobic moment [22] around each residue based on its sequence environment to determine whether this residue could be a binding site. ProMate [23] extracts its input from the 3D structure of an unbound protein; we used it to represent methods that rely on experimentally determined 3D structures. We also included another method that predicts interfaces exclusively using amino acid information (and no aspects of predicted structure or evolutionary profiles) [24] . We arbitrarily chose the operating point at which the coverage of hot spots was 15% (Methods) and checked the accuracy of each method for this coverage ( Fig. 4 ). ISIS and ProMate, the two methods that were most successful, utilize physicochemical features, evolution and structural features. ISIS is the only sequence based method and the structural feature it uses are based on predictions. ProMate, which relies on the 3D structure performed even better. The conclusion of this analysis is that no single features suffice to characterize hot spots. Rather, it takes a complex combination of the aforementioned features that defines a residue as a hot spot.

Fig. 4
fig4.png

  

Fig. 4 : Accuracy of prediction of hot spots at coverage levels of 15%. Several approaches were introduced in the past for the prediction of interfaces residues. We applied methods that rely on different features to the task of predicting hot spots (to which none of them was optimized). The hydrophobic moment method represented the approach that relies exclusively on local physicochemical factors. The evolutionary approach was represented by the evolutionary trace method, which relies on conservation to identify functionally important residues. A knowledge based tool we introduced in the past represented the sequence only approach. Finally, ProMate, a method for predicting interaction sites from unbound structure represented the structure based approach.

 

 



How hot spots differ than other interface residues. It is apparent that the neural networks identified some common denominators between hot spots that distinguish them from other interface residues. This question is hard to address given our current gold standard (namely the dataset of experimental alanine scans). The number of features we use for the prediction (189) is greater than the number of positive data points in our set of alanine scans. To determine to what extent each input feature differentiate between hot spots and other interface residues we need a substantially larger dataset of hot spots and non-hot spot residues. This could be achieve if we assume that ISIS indeed identifies hot spots. Thus, by running ISIS on a large dataset of interface residues we can create a large dataset of predicted hot spots and a large data set of interface residues that are predicted not to be hot spots. Then, we can use these large datasets to analyze the characteristics of hot spots versus the characteristics of other interface residues. We did this using the large dataset of interface residues that was used as a test set for training ISIS. On this dataset we compared the residues that were classified by ISIS as positive (i.e. hot spots) to those that are annotated experimentally as interface residues but are classified by ISIS as negatives. Table 1 is based on the multiple sequence alignment of each protein in this dataset. For each interfaces residue, it shows the average occupancy of its position by each type of amino acid. We also present the average occupancy of each residue in the alignment for experimentally determined hot spots (through alanine scan). These values are presented in parenthesis, as the data that underlie them is sparse (only 100 positions). Note that for some amino acids there are significant differences between hot spot and non hot spot interface residues, while for others there are no substantial differences. Table 1 also presents the p-value for the difference based on a t-test. Note, for example, the 400% overrepresentation of Arginine in predicted hot spot (and the extremely low p-value) with reference to other interface residues. However the percentages of Lysine are virtually the same for both categories. Thus, it is not simple considerations of hydrophobicity that characterize hot spots. Four aliphatic residues are depleted in hot-spots (A, V, I, L), while amide side chains are overrepresented (N and Q). However, the role of aromatics is unclear since Tyrosine is enriched in hot spots, Phenylalanine is depleted and Tryptophan has similar propensities across the interface. The experimental values (shown in parenthesis) are very close to the values obtained for the predicted hot spots, supporting our assumption that ISIS identifies hot spots. However, the limited amount of experimental data limits our ability to elaborate on this comparison. We also compared the conservation and the structural features of both groups. As shown in Fig. 3 there were hardly any differences in conservation. However, the most striking differences were found between structural features ( Table 2 ). The secondary structure state of 39% of the non-hot spots interface residues was loop. In the predicted hot spots, on the other hand, 57% of the residues were in a loop state. In both categories, the rest of the residues were divided roughly equally between helices and strands. Again, there is a striking agreement between the properties of predicted hot spots and the properties of experimental hot spots, despite the fact that ISIS was trained on all interface residues. Predicted hot spots were also much more accessible to solvent than other interface residues.


Table 1
Table 1: Position occupancy in hot spots versus the rest of the interface*.
Amino acid Interface residues (average percentage of occurence)
Non hot spots Predicted hot spots (alanine scan) P-val
I7.222.37 (2.26)10-90
V7.773.05 (3.54)10-82
L10.64.63 (4.51)10-68
R3.2812.7 (12.6)10-62
A7.94.74 (3.67)10-25
Y3.077.32 (8.15)10-20
N3.56.33 (7.31)10-16
F4.992.95 (2.59)10-12
E6.584.57 (7.05)10-11
D4.677.33 (9.4)10-10
G6.368.82 (4.15)10-7
H2.193.44 (2.46)10-6
Q3.34.22 (2.62)10-4
P4.635.89 (3.62)10-3
C2.661.89 (1.1)10-3
T5.825.01 (3.56)0.01
M2.432.16 (1.03)0.21
W1.671.45 (1.33)0.34
S6.326.10 (8.74)0.46
K5.135.08 (8.67)0.86

*  We obtained a multiple sequence alignment for each protein in our dataset. Then, for each residue that is observed to be part of protein-protein interface we calculated the average percentage occupancy for each amino acid in the multiple sequence alignments. We then differentiated between interface residues that were predicted by ISIS to be positive (hot spots) and interface residues that were predicted to be negative (non hot spots). In parenthesis we present the value for the experimentally detected hot spots. The p-value of a t-test (for the significance of the difference between predicted hot spot and non hot spot) is presented in the third column.



Table 2
Table 2: Secondary structure in hot spots versus the rest of the interface*.
Secondary structure Interface residues (%)
Non hot spotsPredicted Hot spots (alanine scan)
Helix29.621.4 (23.7)
Strand29.921.2 (22.5)
Loop39.957.4 (53.8)

*  We recorded the secondary structure state of each residue in the interface and then we compared the percentage of residues in each state between residues that were predicted to be hot spots and the rest of the interface residues. We also recorded the secondary structure state of residues that were observed experimentally to be a hot spot (in parenthesis).



 

Are all hot spots similar? Several studies suggested that hot spots have certain structural characteristics that differentiate them from other residues [25, 26] . The Baker Lab has shown that given a 3D structure of a protein complex, it is possible to predict the results of alanine scans specifically and accurately [27, 28] . This indicates that alanine scans indeed capture some genuine physicochemical commonalities of interaction hot spots that could be identified by a general method that is applicable to all protein complexes. The in silico-alanine scanning is based on analysis of the 3D structure of the interface between two proteins. Thus, it requires a high resolution structure of the protein complex, while ISIS needs only sequence of a single chain regardless of its binding partner. On the other hand, in silico alanine scanning produces numerical prediction of the DDG, While ISIS produces a binary prediction (hot spot / not hot spot). We compared our predictions to those of the in silico alanine scanning, by translating their numerical predictions to binary ones according to cutoffs defined above. Of 55 experimental mutations with DDG>2.5, in silico alanine scanning identified 36 (66%) residues as hot spots. At this coverage, ISIS reached accuracy of about 60% while the in silico alanine scanning reached accuracy of above 75%. Scaled to accuracy of 80%, ISIS identified 18 of these mutations (33%). Thus, for similar levels of positive accuracy, the coverage of ISIS is roughly half of that of the in silico alanine scanning. Obviously, when structures of the complex are available, the in silico alanine scan is a powerful tool for identifying hot spots. However, when only the sequence is available, ISIS can provide accurate predictions for a substantial fraction of the hot spots. Our results indicate that some hot spots can be predicted accurately not only without relaying the 3D structure of the complex but even without the 3D structure of the unbound proteins. Furthermore, our predictions did not require knowledge of the binding partner. Analyzing a single protein using ISIS typically requires a few minutes. Thus, ISIS may allow large-scale analysis of hot spots at a relatively small CPU cost.

 

 


Methods

Data set. We used the ASEdb database of experimental alanine scans [12] , which lists residues that were mutated to alanine and the effect (in terms of ΔΔG) this mutation had on the interaction between two proteins. We checked the correlation between the predictions and the residues that were shown experimentally to substantially affect the affinity of the proteins in a complex to each other. In order to reduce the number of cases in which the effect of the mutation on binding was not due to a change in the interface (for example the cases in which the mutation destabilized the structure), we considered only exposed residues in proteins of known structure. Thus our test set included 80 protein chains with hundreds of experimental substitutions. From among these, we analyzed the mutations that substantially changed the binding energy (ΔΔG>2.5kcal/mol), and those that had no effect (ΔΔG=0). Altogether we attempted to predict the experimental effect of 296 substitutions. The predictions were performed using ISIS [17] . ISIS can take as input either sequence or the coordinate of 3D structure of unbound chains (the results are more accurate when using known 3D structures). However, for all values reported here, we ran ISIS from sequence alone.

Measuring performance. The accuracy and coverage of the ISIS were measured using ratios derived from TP (true positives), defined as the number of residues predicted by ISIS (below) to be in a protein-protein interface and observed to be in a hot spot, i.e. was found to have an extreme effect on binding (ΔΔG>2.5kcal/mol), FP (false positives), defined as the number of residues predicted in protein-protein interfaces but was found, upon mutation, to have no effect on binding (ΔΔG=0), and FN (false negatives), i.e. the number of residues predicted not to be in a protein-protein interface that were observed to have a strong effect on binding (ΔΔG>2.5kcal/mol). We used:

  (Eq. 1)

  (Eq. 2)

ISIS. ISIS is a knowledge based method we developed to identify interface residues from sequence [17] . It is based on a system of neural networks and uses as input the sequence environment of each residue, its evolutionary profile (the frequency of each type of amino acid in a given position of the alignment), and its predicted secondary structure and accessibility to the solvent. In particular, when a sequence is submitted as a query, ISIS runs PSI-BLAST [29] , generates a multiple sequence alignment and produces an evolutionary profile for each residue. These data are then sent to PROF [30, 31] , a system of neural networks which predicts the secondary structure state and the solvent accessibility of each residue. Finally, the sequence environment, the evolutionary profile and the predicted structural features serve as input to another neural network which annotates each residue as interface or non-interface. ISIS was trained on a non redundant version of all transient protein-protein interfaces [32] in PDB (the 3D structures were used only to identify the residues the residue spatially in the interface. No experimental 3D information was used for training).

Training the neural network: First level prediction. We trained standard feed-forward neural networks with back-propagation and momentum term on windows of 9 consecutive residues. A window was defined as positive, if the central residue had any atom that was within 6 of any atom in a different protein. This yielded a set with 59,559 positive samples. We trained on two thirds of the data and tested it on the remaining one third.

Second level refinement filter. Next, we filtered the raw network predictions. Our analysis of protein interfaces at the sequence level suggested that most interacting residues have other interacting residues in their sequence neighborhood. Therefore, we eliminated predictions with fewer than seven raw predictions within ten adjacent residues (five on either side).

Random model. To obtain the expected coverage and accuracy at random we reshuffled the predictions in the following way: each protein was represented by two strings of the same length, one representing its sequence and the other representing the predictions (P for an interacting residue, -for a non interacting residue). Then, we split the prediction string into half and assigned the predictions of the first half of the sequence to the second and vice versa. This process accounted for any size effect that could be caused by the number of predictions and for any effect caused by the heterogeneous distribution of contacting residues along the sequence. Furthermore, it enabled us to find a specific expectation for each scaling of the prediction. We generated different random models for different values of the ROC-like curve ( Fig. 2 C). Our background model captured how random our predictions were rather than how well we could predict interface residues at random.

No overlap between data sets used for development and for assessment. ISIS was developed on a data set of 1134 chains in 333 complexes which contained 59,559 residue contacts. In the assessment of ISIS no sequence that was used for training had any significant similarity for any of the sequences that were used for testing. That is, no protein in the test set could have been modeled by any protein in the development sets by homology-based predictions [33, 34] .

Implementing and applying other methods. We chose methods that represent the variety of approaches for predicting interaction sites. ProMate [23] is structure based method that extracts features from an unbound chain and uses them to predict the binding site. We also chose three sequence based methods: a sequence only method [17] , an evolutionary based method (evolutionary trace [20, 35] ) and a biophysics based one (hydrophobic moment [21] ). The first two were available as servers for public use. The hydrophobic moment was not publicly available, thus we implemented it for the purpose of this analysis. We chose an operating point of coverage = 15%, which was the highest coverage reached by the hydrophobic moment tool.

Comparing hot spots to other interface residues. We used the dataset of interface residues that was used to test ISIS originally [17] . In this dataset there are over 20,000 interface residues, 2,182 of which were classified by ISIS as positive. Attempting to zoom in on the differences between hot spots and other interface residues we compared the features of these 2,182 residues to the features of the residues that were classified as negative. The results of the comparison for amino acids are presented in Table 1 and are based on the evolutionary profile we used for prediction. For each interface residue, we used a multiple sequence alignment to check how often each residue is present in this position. We performed the same analysis for all the positions that were found experimentally, by alanine scanning, to be hot spots. Table 1 shows the average percentage occupancy of each amino acid in all positively predicted positions in all negatively predicted interface residues.

 

 

Acknowledgements

Thanks to Jinfeng Liu and Paul Glick (Columbia) for computer assistance, and to Mickey Kosloff (Columbia), Guy Nimrod, Gilad Wainreb, Uri Rom (all Tel Aviv University) for help with graphics. Special thanks also to Lawrence Shapiro, Wayne Hendrickson, Barry Honig, David Hirsh, and Oliver Hobert (all Columbia) for helpful discussions. The work of YO and BR was supported by the grant R01-GM64633-01 from the National Institute of General Medicine (NIGMS) at the National Institutes of Health (NIH). Thanks also to the reviewers who suggested very insightful additional analysis. Last, not least, thanks to all those who maintain excellent databases and to all experimentalists who enabled this work by making their data publicly available.

 


BEG

References

1.Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S. et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-627.

2.Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M. et al. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141-147.

3.Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L. et al. (2002). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180-183.

4.Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B. et al. (2003). A protein interaction map of Drosophila melanogaster. Science, 302, 1727-1736.

5.Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S. et al. (2004). A map of the interactome network of the metazoan C. elegans. Science, 303, 540-3.

6.Bogan, A. A. & Thorn, K. S. (1998). Anatomy of hot spots in protein interfaces. J Mol Biol, 280, 1-9.

7.DeLano, W. L. (2002). Unraveling hot spots in binding interfaces: progress and challenges. Curr Opin Struct Biol, 12, 14-20.

8.Kwong, P. D., Wyatt, R., Robinson, J., Sweet, R. W., Sodroski, J. et al. (1998). Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature, 393, 648-659.

9.Wells, J. A. (1991). Systematic mutational analyses of protein-protein interfaces. Methods Enzymol, 202, 390-411.

10.Morrison, K. L. & Weiss, G. A. (2001). Combinatorial alanine-scanning. Curr Opin Chem Biol, 5, 302-7.

11.Clackson, T. & Wells, J. A. (1995). A hot spot of binding energy in a hormone-receptor interface. Science, 267, 383-6.

12.Thorn, K. S. & Bogan, A. A. (2001). ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics, 17, 284-5.

13.de Vos, A. M., Ultsch, M. & Kossiakoff, A. A. (1992). Human growth hormone and extracellular domain of its receptor: crystal structure of the complex. Science, 255, 306-12.

14.Horovitz, A. (1996). Double-mutant cycles: a powerful tool for analyzing protein structure and function. Fold Des, 1, R121-6.

15.Vaughan, C. K., Buckle, A. M. & Fersht, A. R. (1999). Structural response to mutation at a protein-protein interface. J Mol Biol, 286, 1487-506.

16.Reichmann, D., Rahat, O., Albeck, S., Meged, R., Dym, O. et al. (2005). The modular architecture of protein-protein binding interfaces. Proc Natl Acad Sci U S A, 102, 57-62.

17.Ofran, Y. & Rost, B. (2007). ISIS: Interaction Sites Identified from Sequence. Bioinformatics, 23, e13-6.

18.Ashkenazi, A., Presta, L. G., Marsters, S. A., Camerato, T. R., Rosenthal, K. A. et al. (1990). Mapping the CD4 binding site for human immunodeficiency virus by alanine-scanning mutagenesis. Proc Natl Acad Sci U S A, 87, 7150-4.

19.Ranganathan, R., Lewis, J. H. & MacKinnon, R. (1996). Spatial localization of the K+ channel selectivity filter by mutant cycle-based structure analysis. Neuron, 16, 131-9.

20.Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol, 257, 342-58.

21.Gallet, X., Charloteaux, B., Thomas, A. & Brasseur, R. (2000). A fast method to predict protein interaction sites from sequences. J Mol Biol, 302, 917-26.

22.Eisenberg, D., Weiss, R. M. & Terwilliger, T. C. (1982). The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature, 299, 371-4.

23.Neuvirth, H., Raz, R. & Schreiber, G. (2004). ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol, 338, 181-99.

24.Ofran, Y. & Rost, B. (2003). Predicted protein-protein interaction sites from local sequence information. FEBS Lett, 544, 236-9.

25.Halperin, I., Wolfson, H. & Nussinov, R. (2004). Protein-protein interactions; coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking. Structure, 12, 1027-38.

26.Keskin, O., Ma, B. & Nussinov, R. (2005). Hot regions in protein--protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol, 345, 1281-94.

27.Kortemme, T. & Baker, D. (2002). A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci U S A, 99, 14116-21.

28.Kortemme, T., Kim, D. E. & Baker, D. (2004). Computational alanine scanning of protein-protein interfaces. Sci STKE, 2004, pl2.

29.Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402.

30.Rost, B., Yachdav, G. & Liu, J. (2004). The PredictProtein server. Nucleic Acids Research, 32, W321-W326.

31.Rost, B. (2005). How to use protein 1D structure predicted by PROFphd. In The Proteomics Protocols Handbook (Walker, J. E., eds.), pp. 875-901, Humana, Totowa NJ.

32.Ofran, Y. & Rost, B. (2003). Analysing six types of protein-protein interfaces. J Mol Biol, 325, 377-87..

33.Aloy, P., Oliva, B., Querol, E., Aviles, F. X. & Russell, R. B. (2002). Structural similarity to link sequence space: New potential superfamilies and implications for structural genomics. Protein Science, 11, 1101-1116.

34.Aloy, P. & Russell, R. B. (2002). Interrogating protein interaction networks through structural biology. Proceedings of the National Academy of Sciences, 99, 5896-5901.

35.Innis, C. A., Shi, J. & Blundell, T. L. (2000). Evolutionary trace analysis of TGF-beta and related growth factors: implications for site-directed mutagenesis. Protein Eng, 13, 839-47.

36.Schneider, R. & Sander, C. (1996). The HSSP database of protein structure-sequence alignments. Nucleic Acids Research, 24, 201-205.

 



Contact:    admin@rostlab.org Version:    Aug 10, 2007
 top - TOC - CUBIC-papers - CUBIC - Rost group